Automatic coalescing of GPU-initiated network communication

ABSTRACT

Apparatuses, systems, and techniques are directed to automatic coalescing of GPU-initiated network communications. In one method, a communication engine receives, from a shared memory application executing on a first graphics processing unit (GPU), a first communication request assigned to or having a second GPU as a destination to be processed. The communication engine determines that the first communication request satisfies a coalescing criterion and stores the first communication request in association with a group of requests that have a common property. The communication engine coalesces the group of requests into a coalesced request and transports the coalesced request to the second GPU over a network.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under AgreementH98230-16-3-0001 awarded by U.S. Department of Defense (DoD). Thegovernment has certain rights in the invention.

TECHNICAL FIELD

At least one embodiment pertains to processing resources used to performand facilitate network communication. For example, at least oneembodiment pertains to technology for automatic coalescing ofGPU-initiated network communication.

BACKGROUND

Data movement requests initiated by graphics processing unit (GPU)threads are typically fine-grain requests. For example, a GPU threadtypically initiates a communication request (e.g., put/get/atomicrequests) for a single data element. These communication requests can besent between distributed GPUs over peer-to-peer (P2P) connections ornetwork connections. Communication requests with small payloads can betransmitted efficiently over P2P connections. However, communicationrequests with small payloads sent over a network between GPUs can resultin low network efficiency and poor performance.

BRIEF DESCRIPTION OF DRAWINGS

Various embodiments in accordance with the present disclosure will bedescribed with reference to the drawings, in which:

FIG. 1 is a network diagram of a GPU having a communication engine witha coalescing agent, in accordance with at least some embodiments;

FIG. 2 is a block diagram of a communication engine with a coalescingagent, in accordance with at least some embodiments;

FIG. 3 is a flow diagram of a method for coalescing a group ofcommunication requests into a coalesced request, in accordance with atleast some embodiments;

FIG. 4 is a network diagram of a computer system having multiple GPUscoupled via a network and peer-to-peer (P2P) connections, the GPUs eachincluding a communication engine with a coalescing agent, in accordancewith at least some embodiments;

FIG. 5 is a network diagram of a first GPU coupled to a second GPUhaving a coalescing agent, in accordance with at least some embodiments;

FIG. 6 is a network diagram of a first GPU coupled to a CPU having acoalescing agent, in accordance with at least some embodiments;

FIG. 7 is a network diagram of a first GPU coupled to a hardware offloadcircuit having a coalescing agent, in accordance with at least someembodiments;

FIG. 8 is a block diagram of a computing device having a GPU with acoalescing agent, in accordance with at least some embodiments;

FIG. 9 illustrates an example data center system, according to at leastone embodiment;

FIG. 10A illustrates an example of an autonomous vehicle, according toat least one embodiment;

FIG. 10B illustrates an example of camera locations and fields of viewfor the autonomous vehicle of FIG. 10A, according to at least oneembodiment;

FIG. 10C illustrates an example system architecture for the autonomousvehicle of FIG. 10A, according to at least one embodiment;

FIG. 10D illustrates a system for communication between cloud-basedserver(s) and the autonomous vehicle of FIG. 10A, according to at leastone embodiment; and

FIG. 11 illustrates a computer system, according to at least oneembodiment.

DETAILED DESCRIPTION

As described above, data movement requests between distributed GPUs overa network can have smaller payloads that result in low networkefficiency and poor performance. Some GPU architectures can includecoalescing mechanisms to combine multiple data transfer requests (e.g.,load/store operations) into a single request that can be transmittedefficiently via a P2P link (e.g., NVLink). When data transfers areperformed via a network (e.g., InfiniBand or Ethernet), these coalescingmechanisms are not leveraged, resulting in low network efficiency andpoor performance. For example, one solution performs automaticwarp-level coalescing using software techniques to detect requestsoriginating from multiple threads in a warp (also referred to as a groupof threads) and coalesces these requests. While this techniqueeffectively detects requests for coalescing, the resulting packet sizesare still typically 8 B*32 Threads=256 B data payloads, which are stillinefficient for transfer over modern high-speed networks. The warp-levelcoalescing can only coalesce requests associated with contiguous memorylocations.

Aspects and embodiments of the present disclosure address these andother challenges by providing a coalescing agent (software or hardwarelogic) in a communication engine that analyzes communication requestshorizontally across multiple groups of threads (warps or cooperativethread arrays (CTAs)), as well as vertically across multiple requestsissued by each thread, including requests associated with non-contiguousmemory locations, before transporting the communication requests over anetwork. Aspects and embodiments of the present disclosure can achieve agreater degree of coalescing than prior approaches.

In at least one embodiment, the communication engine can use a GPUcompanion kernel that services both fine-grain communication requests(e.g., shared memory requests such as NVSHEM requests) for direct P2Pdata transfers (e.g., NVLink) and communication requests for networktransport. The direct P2P transfer requests bypass the coalescing agentand are issued directly on the P2P link since the P2P link alreadyprovides good efficiency for small data transfers. The coalescing agentprocesses the network transfer requests before network transport sincethey cannot be serviced using direct, P2P transfers. In at least oneembodiment, for communication requests over the network, the coalescingagent stores the communication requests in request queues to be analyzedfor coalescing before being sent over the network. In at least oneembodiment, a communication engine with a coalescing agent receives,from a shared memory application executing on a first GPU, a firstcommunication request initially assigned to (be processed by) a secondGPU. The communication engine determines that the first communicationrequest satisfies a coalescing criterion and stores the firstcommunication request in association with a group of requests that havea common property. The coalescing agent determines that a timerassociated with the group of requests expires or a size of the groupsatisfies a group size criterion and coalesces the group of requestsinto a coalesced request. The communication engine transports thecoalesced request to the second GPU over a network.

Aspects and embodiments of the present disclosure can analyzecommunication requests at a much broader scope (e.g., across multiplewarps/CTAs) and across multiple requests issued by a single thread andcoalesces communication requests having a common property beforetransporting over a network. Aspects and embodiments of the presentdisclosure can coalesce communication requests associated withnoncontiguous memory locations.

FIG. 1 is a network diagram of a GPU 100 having a communication engine102 with a coalescing agent 104, in accordance with at least someembodiments. The GPU 100 can be used to perform various operations,including speech recognition, object recognition, or any inferencingoperations involving machine learning. In some embodiments, GPU 100includes multiple cores, and each core is capable of executing multiplethreads. Each core may run multiple threads concurrently (e.g., inparallel). In some embodiments, threads may have access to registers.Registers may be thread-specific registers with access to a registerrestricted to a respective thread. Additionally, shared registers may beaccessed by all threads of the core. In some embodiments, a core mayinclude a scheduler to distribute computational tasks and processesamong different threads of a core. A dispatch unit may implementscheduled tasks on appropriate threads using correct private registersand shared registers.

In some embodiments, the GPU 100 may have a (high-speed) cache, accessto which may be shared by multiple cores. Furthermore, the GPU 100 canbe associated with a GPU memory where the GPU 100 may store intermediateand/or final results (outputs) of various computations performed by theGPU 100. After completing a particular task, the GPU 100 may move theoutput to (main) memory. In some embodiments, the CPU may executeprocesses involving serial computational tasks (assigned by one of thepipeline engines). In contrast, the GPU 100 may execute tasks (such asmultiplication of inputs of a neural node by weights and adding biases)that are amenable to parallel processing. In some embodiments, arespective engine of the pipeline (e.g., an inference engine) maydetermine which processes managed by the respective engine are to beexecuted on the GPU 100 and which processes are to be executed on a CPU(not illustrated in FIG. 1 ). In some embodiments, the CPU may determinewhich processes are to be executed on GPU 100 and which processes are tobe executed on the CPU.

In at least one embodiment, as illustrated in FIG. 1 , the GPU 100 canexecute a shared memory application 106. There are instances where theGPU 100 needs to communicate with a second GPU 110. To communicate withthe second GPU 110, the GPU 100 sends communication requests 101 to thecommunication engine 102, which includes a coalescing agent 104. Thecoalescing agent 104 can analyze communication requests horizontallyacross a single thread, across a group of thread, across multiple groupsof threads (warps or CTAs), as well as vertically across multiplerequests issued by each thread, including requests associated withnon-contiguous memory locations, before transporting the communicationrequests 101 over a network 116. The coalescing agent 104 can achieve agreater degree of coalescing than prior approaches to perform warp-levelcoalescing for communication requests associated with contiguous memorylocations.

During operation, the communication engine 102 can receive acommunication request initially assigned to and/or having the second GPU110 as a destination for processing. The communication engine 102determines whether the first communication request satisfies acoalescing criterion. For example, the communication engine 102 candetermine whether the first communication request satisfies a requestsize criterion. The request size criterion can be a threshold size ofrequests that do not need coalescing. If the first communication requestis less than the threshold size, then the communication engine 102determines that the first communication request satisfies the requestsize criterion. Alternatively, the communication engine 102 candetermine whether the first communication request satisfies a latencycriterion. In at least one embodiment, the latency criterion can be setto allow coalescing so long as the coalesced communication request stillmeets a specified latency requirement. Alternatively, the communicationengine 102 can determine whether the first communication requestsatisfies a P2P connectivity criterion. In at least one embodiment, theP2P connectivity criterion can require that the communication request besent via a network connection and not a P2P connection between the GPU100 and the second GPU 110. If the first communication request needs tobe sent via the P2P connection, the first communication request shouldnot be coalesced and can fail to satisfy the P2P connectivity criterion.In at least one embodiment, the communication engine 102 can determinewhich P2P connections are available and determine whether thecommunication requests 101 can reach their respective destinations overthe P2P connections.

In at least one embodiment, responsive to a determination that the firstcommunication request satisfies the coalescing criterion, the coalescingagent 104 stores the first communication request in association with agroup of requests where each request has a common property. In at leastone embodiment, the coalescing agent 104 can store the firstcommunication request in an associative data structure where the firstrequest is appended to the group of requests that have the commonproperty. In at least one embodiment, the common property is the sametype of operation (e.g., read, write, or atomic operation). In at leastone embodiment, the common property is the same network destination(e.g., the same memory or network address). In at least one embodiment,the common property is the same GPU destination. In at least oneembodiment, the common property is adjacent memory locations (e.g.,adjacent memory locations at which data should be read according tocommunication requests 101 or to which data should be written accordingto communication requests 101). Alternatively, other common propertiescan be used to group communication requests 101 into a coalesced request103 for efficient transport over the network 116.

In at least one embodiment, the coalescing agent 104 can determinewhether a timer associated with the group of requests expires. In atleast one embodiment, the coalescing agent 104 can determine whether asize of the group of requests satisfies a group size criterion. Thegroup size criterion can be a minimum size desired for requests beingtransported over a network 116. Alternatively, the group size criterioncan be specified as a threshold size that needs to be satisfied forcoalescing. In at least one embodiment, in response to a determinationthat the timer expires or the size of the group of requests satisfiesthe group size criterion, the coalescing agent 104 coalesces the groupof requests into a coalesced request 103. In at least one embodiment,the coalescing agent 104 generates a data layout description in a datapayload of the coalesced request 103 to indicate how to uncoalesce thecoalesced request 103. The communication engine 102 transports thecoalesced request 103 to the second GPU 110 over the network 116.

In at least one embodiment, the communication engine 102 (and/or thecoalescing agent 104) is implemented as a software communication enginein the GPU 100. In at least one embodiment, the communication engine 102(and/or the coalescing agent 104) is implemented in a first kernel inthe GPU 100, and the shared memory application 106 is executed in asecond kernel in the GPU 100. The first kernel can be considered acompanion kernel to the second kernel. In at least one embodiment, thecommunication engine 102 (and/or the coalescing agent 104) isimplemented as a software communication engine in a third GPU (notillustrated in FIG. 1 ) coupled to the GPU 100. In another embodiment,as described in more detail below, the communication engine 102 (and/orthe coalescing agent 104) is implemented as hardware logic in a hardwareoffload circuit coupled to the GPU 100. In another embodiment, asdescribed in more detail below, the communication engine 102 (and/or thecoalescing agent 104) is implemented as a software communication enginein a CPU operatively coupled to the GPU 100. Alternatively, thecommunication engine 102 (and/or the coalescing agent 104) isimplemented as software within the GPU 100 executing the shared memoryapplication 106. In at least one embodiment, the coalescing agent 104 isa software-based coalescing agent that aggregates multiple GPU-initiatedcommunication requests 101, such as put, get, atomic requests, intocoalesced requests to boost performance and improve network efficiency.In at least one embodiment, the coalescing agent 104 can be located in aGPU companion kernel to achieve low-overhead software-based coalescing.In at least one embodiment, the coalescing agent 104 is a software-basedcoalescing agent that executes on a CPU coupled to the GPU 100. Thecoalescing agent 104 can use a request queue where the communicationrequests can be placed and analyzed before being handled by the network116.

As illustrated in FIG. 2 , since the GPU 100 sends the coalesced request103 over the network 116 to the second GPU 110, the second GPU 110 alsoincludes a communication engine 112 with a coalescing agent 114. Thecommunication engine 112 can determine that an incoming request is thecoalesced request 103 (e.g., based on the request header) and pass thecoalesced request 103 to the coalescing agent 114 to be processed. Thecoalescing agent 114 can read a data payload of the coalesced request103. If the request contains a data payload from a put operation thatrequires uncoalescing (e.g., unpacking), the coalescing agent 114 readsa data layout description from the data payload and delivers data to amemory (not illustrated in FIG. 1 ) being associated with the second GPU110 as indicated in the data layout description.

FIG. 2 is a block diagram of a communication engine 202 with acoalescing agent 204, in accordance with at least some embodiments.Communication engine 202 can provide a shared memory application 206with seamless P2P transport (e.g., NVLink or PCIe) and network transport(e.g., InfiniBand or remote direct memory access (RDMA) over a convergedEthernet (RoCE)). The shared memory application 206 can utilize memory208 (e.g., symmetric memory) and one or more GPU kernels 210. In somecases, the shared memory application 206 is executed on an applicationkernel in a GPU, and the communication engine 202 is executed orotherwise implemented in a companion kernel in the same GPU. In anotherembodiment, the communication engine 202 is implemented in a differentGPU coupled to the GPU with the shared memory application 206. Inanother embodiment, the communication engine 202 is implemented on a CPUcoupled to the GPU with the shared memory application 206. In anotherembodiment, the communication engine 202 is implemented in a hardwareoffload circuit coupled to the GPU with the shared memory application206.

In at least one embodiment, the shared memory application 206 is anNVSHMEM application, and the communication engine 202 is an NVSHMEMcommunication engine. In at least one embodiment, the communicationengine 202 supports a range of asynchronous read (e.g., NVSHMEM get),write (e.g., NVSHMEM put), and atomic update operations from a singleGPU thread, multiple GPU threads, warps, CTAs, or other groups of GPUthreads. In at least one embodiment, there can be buffer queues (e.g.,first-in-first-out (FIFO) queues) between the shared memory application206 and the communication engine 202 to decouple the components asillustrated in FIG. 2 .

In at least one embodiment, the communication engine 202 includes atransmit agent 212, which handles outgoing communication requests 201 bythe shared memory application 206. In at least one embodiment, thetransmit agent 212 includes routing logic 218, a transport engine 220,and the coalescing agent 204. In at least one embodiment, thecommunication engine 202 includes a receive agent 214, which handlesincoming communication requests. In at least one embodiment, the receiveagent 214 includes routing logic 222, a transport engine 226, and acoalescing agent 224.

During the operation of the transmit agent 212, the routing logic 218determines whether a communication request 201 should be coalesced. Inat least one embodiment, the routing logic 218 can determine whether thecommunication request 201 satisfies a coalescing criterion. The routinglogic 218 can perform one or more checks whether the communicationrequest 201 satisfies the coalescing criterion as described below.

In at least one embodiment, the routing logic 218 can determine that thecommunication request 201 does not satisfy the coalescing criterionresponsive to a determination that the communication request 201 istransportable via a P2P connection 228 with a second GPU 230. In atleast one embodiment, if the communication request 201 is to the secondGPU 230 that is accessible via the P2P connection 228 (e.g., satisfies aP2P connectivity criterion), the communication request 201 can beforwarded directly to the transport engine 220, which may simplypass-through to the direct P2P communication with the second GPU 230over the P2P connection 228. In at least one embodiment, checks for P2Pconnectivity can be performed through a combination of static and/ordynamic checking to optimize performance. In at least one embodiment,the routing logic 218 can determine that the communication request 201does not satisfy the coalescing criterion responsive to a determinationthat the communication request 201 exceeds a threshold size or otherwisesatisfies a size criterion. The routing logic 218, responsive to adetermination that the communication request 201 satisfies the sizecriterion, can forward the communication request 201 directly to thetransport engine 220, which may pass the communication request 201 to athird GPU 232 via the network 216. In at least one embodiment, therouting logic 218 can determine that the communication request 201 doesnot satisfy the coalescing criterion responsive to a determination thatthe communication request 201 is a latency-sensitive request. In atleast one embodiment, the routing logic 218 can determine whether thecommunication request 201 satisfies a latency criterion. In at least oneembodiment, the routing logic 218 can use a heuristic approach todetermine whether the communication request 201 satisfies a latencycriterion. The routing logic 218, responsive to a determination that thecommunication request 201 satisfies the latency criterion, can forwardthe communication request 201 directly to the transport engine 220,which may pass the communication request 201 to the third GPU 232 viathe network 216. In at least one embodiment, the routing logic 218 candetermine that the communication request 201 originates from a group ofthreads (e.g., warp or CTA) and the routing logic 218 can performgroup-level coalescing before proceeding. Alternatively, the routinglogic 218 can determine that the communication request 201 is acoalesced request where group-level coalescing has already beenperformed on the coalesced request. The routing logic 218 can determinewhether the coalesced request satisfies a coalescing criterion similarlyas described above with the communication request 201.

In at least one embodiment, responsive to the routing logic 218determining that the communication request 201 should be coalesced, thecoalescing agent 204 stores the communication request 201 in associationwith a group of requests that have a common property. In at least oneembodiment, the coalescing agent 204 inserts a request entry into anassociative data structure 234, where it is appended to a request groupof similar requests. In at least one embodiment, similar requests canhave one or more common properties, such as same operations, samenetwork destination, same GPU destination, adjacent memory locations, orthe like. In at least one embodiment, if a new request group is created,the coalescing agent 204 resets a corresponding timer for the newrequest group. In at least one embodiment, the timer establishes anupper bound on a latency that the request group incurs in the coalescingagent 204. In at least one embodiment, the coalescing agent 204 canspecify a group size criterion for the new request group as well. Thegroup size criterion can establish a desired minimum size for networktransports over the network 216. In another embodiment, the group sizecriterion can set a range of sizes and the coalesced request isdetermined to satisfy the group size criterion if the size of thecoalesced request is within the set range of sizes.

In at least one embodiment, the coalescing agent 204 can create andmanage one or more associative data structures for one or more groups ofrequests, respectively. In at least one embodiment, the coalescing agent204 determines that a timer associated with the group of requestsexpires or the group's size satisfies a group size criterion. Thecoalescing agent 204 coalesces the group of requests into a coalescedrequest 203 and forwards the coalesced request 203 to the transportengine 220. In at least one embodiment, the coalescing agentperiodically scans the list of request groups. Any groups whose timerhas expired are coalesced and forwarded to the transport engine 220. Inat least one embodiment, the transport engine 220 sends the coalescedrequest 203 to the destination (e.g., third GPU 232). In at least oneembodiment, the transport engine 220 can pass P2P communication requests(e.g., P2P transfers) as direct P2P transfers via the P2P connection(s)228. The P2P transfers can be contiguous or noncontiguous transfers. Inat least one embodiment, for contiguous network transfers, the transportagent 220 can transport the communication requests via RDMA operations.In at least one embodiment, for noncontiguous network transfers, thetransport agent 220 can write or read a packed data payload to or fromthe destination. In at least one embodiment, the transport engine 220can coordinate with the destination to establish a noncontiguous datatransfer (e.g., using I/O vector transmission or the User-Mode MemoryRegistration (UMR) feature of an InfiniBand NIC).

During the operation of the receive agent 214, the transport engine 226notifies the routing logic 222 that a communication request 205 wasreceived. The routing logic 222 determines whether the communicationrequest 205 should be uncoalesced. In at least one embodiment, therouting logic 222 can determine whether the communication request 205has a payload directed to contiguous memory or an RDMA noncontiguoustransfer was performed (e.g., using IOV or UMR capability). For RDMA andP2P transfers, it should be appreciated that functions of the transportengine 226 and routing logic 222 may be considered to be implementedentirely by the underlying hardware, and the software processing can bebypassed. In another embodiment, the routing logic 222 can determinethat the communication request 205 includes a data payload for a writeoperation (e.g., put) that requires uncoalescing (e.g., unpacking). Inthis scenario, the coalescing agent 224 reads a data layout descriptionfrom the data payload and delivers data words to the memory 208 asindicated. In at least one embodiment, if the communication request 205includes a data payload from an atomic operation that requiresuncoalescing (e.g., unpacking), the coalescing agent 224 reads a datalayout description from the data payload and performs atomic updates tothe memory 208 as indicated. If the atomic operations request returndata, the coalescing agent 224 generates a packed data response andreturns to the initiator through the transmit agent 212. In at least oneembodiment, if the communication request 205 is for a read that requiresuncoalescing, the coalescing agent 224 reads a data layout descriptionfrom the data payload, generates the packed data response, and returnsthe data through the transmit agent 212. In at least one embodiment, ifthe communication request 205 is associated with a small data payload,the coalescing agent 224 can send the responses to the transmit agent212 for coalescing by forming request groups to aggregate responses tothe initiators.

FIG. 3 is a flow diagram of a method 300 for coalescing a group ofcommunication requests into a coalesced request, in accordance with atleast some embodiments. Method 300 can be performed by processing logiccomprising hardware, software, firmware, or any combination thereof. Inat least one embodiment, the method 300 is performed by thecommunication engine 102 or the coalescing agent 104 of FIG. 1 . In atleast one embodiment, the method 300 is performed by the communicationengine 202 of FIG. 2 . In at least one embodiment, the method 300 isperformed by the transmit agent 212 of FIG. 2 . In at least oneembodiment, the method 300 is performed by the routing logic 218 andcoalescing agent 204 of FIG. 2 .

Referring back to FIG. 3 , the method 300 begins by processing logicreceiving, from a shared memory application executing on a first GPU, afirst communication request initially assigned to and/or having a secondGPU (block 302) as a destination for processing. The processing logicdetermines that the first communication request satisfies a coalescingcriterion (block 304). The processing logic stores the firstcommunication request in association with a group of requests with acommon property (block 306). In at least one embodiment, the commonproperty is at least one of the same operation type, the same networkdestination, the same GPU destination, or adjacent memory locations. Theprocessing logic determines that a timer associated with the group ofrequests expires or the group's size satisfies a group size criterion(block 308). The processing logic coalesces the group of requests into acoalesced request (block 310). The processing logic transports thecoalesced request to the second GPU over a network (block 312).

In a further embodiment, the processing logic receives a secondcommunication request. In at least one embodiment, the firstcommunication request originates from a first group of threads of thefirst GPU, and the second communication request originates from a secondgroup of threads of the first GPU. The processing logic determines thatthe second communication request satisfies the coalescing criterion andthat it has a common property with the first communication request. Theprocessing logic stores the second communication request in associationwith the group of requests with the common property.

In at least one embodiment, the processing logic receives a secondcommunication request initially assigned to and/or having a third GPU asa destination for processing. The processing logic determines that thesecond communication request does not satisfy the coalescing criterion.For example, the second communication is transportable via a P2Pconnection with the third GPU. In at least one embodiment, the secondcommunication request does not satisfy the coalescing criterion forother reasons. The processing logic transports the second communicationrequest to the third GPU over the P2P connection.

In at least one embodiment, the processing logic determines that thefirst communication request satisfies the coalescing criterion bydetermining that the first communication request satisfies at least oneof a request size criterion, a latency criterion, or a P2P connectivitycriterion.

In at least one embodiment, the processing logic receives a secondcommunication request that originates from a group of threads of thefirst GPU and is initially assigned to and/or has a third GPU as adestination. The processing logic performs group-level coalescing of thesecond communication request with other communication requests from thegroup of threads to obtain a group-level request. The processing logicdetermines that the group-level request satisfies the coalescingcriterion and stores the group-level request in association with asecond group of requests with a common property. The processing logicdetermines that a second timer associated with the second group ofrequests expires or the second group's size satisfies the group sizecriterion. The processing logic coalesces the second group of requestsinto a second coalesced request and transports the second coalescedrequest to the third GPU. In another embodiment, the processing logicreceives a communication request that is already a group-level coalescedrequest where another entity has already performed group-levelcoalescing.

As described herein, in at least one embodiment, the processing logic isimplemented as software in the first GPU that executes the shared memoryapplication, as software in a second GPU that is coupled to the firstGPU, as software in a CPU coupled to the first GPU, or as hardware suchas in a hardware offload circuit. In at least one embodiment, theprocessing logic is implemented in a first kernel, a companion kernel,in the first GPU, and the shared memory application is executed in asecond kernel in the first GPU. Additional details of differentconfigurations are described below with respect to FIGS. 4-7 .

FIG. 4 is a network diagram of a computer system 400 having multipleGPUs coupled via a network 416 and peer-to-peer (P2P) connections 428,the GPUs each including a communication engine with a coalescing agent,in accordance with at least some embodiments. Computer system 400includes a first GPU 410, a second GPU 420, a third GPU 430, a fourthGPU 440, and a fifth GPU 450. The first GPU 410 includes memory 408 anda communication engine 420, which includes a coalescing agent 404. Thecommunication engine 420 and coalescing agent 404 are similar to thecommunication engines and coalescing agents described above with respectto FIGS. 1-3 . The first GPU 410 executes a shared memory application406 that can communicate with other GPUs for data transfers to othermemory distributed at other GPUs.

It should be noted that although illustrated as part of the GPUs, thememory can be external to the GPU but otherwise associated with therespective GPU. For example, the memory 408 is local to the first GPU410, whereas memory 418 is local to the second GPU 420. The first GPU410 can send or receive communication requests to and from the secondGPU 420 via a first P2P connection in a P2P fabric or P2P network 422.The first GPU 410 can send or receive communication requests to and fromthe third GPU 430 via a second P2P connection in the P2P fabric or P2Pnetwork 422. The first GPU 410 can send or receive communicationrequests to and from the fourth GPU 440 via a network 416. The first GPU410 can send or receive communication requests to and from the fifth GPU450 via the network 416. As illustrated, the fourth GPU 440 can send orreceive communication requests to and from the fifth GPU 450 via a P2Pconnection in a P2P fabric or P2P network 424. During operation, thecommunication engine 402 can receive a communication request from theshared memory application 406 and determine whether the communicationrequest should be coalesced. If the communication request should becoalesced, the coalescing agent 404 coalesces the communication requestwith a group of requests into a coalesced request and sends thecoalesced request over the network 416. If the communication requestdoes not need to be coalesced, the communication engine 402 can send thedestination GPU's communication request via the appropriate connection(e.g., P2P 422, network 416). Each of the GPUs can include acorresponding memory and communication engine with a coalescing agent.In other embodiments, multiple GPUs can be connected via P2P and networkconnections in other configurations.

FIG. 5 is a network diagram of a first GPU 510 coupled to a second GPU520 having a coalescing agent 514, in accordance with at least someembodiments. The first GPU 510 executes a shared memory application 508that sends or receives communication requests to and from other GPUs.The communication engine 502 may not include a coalescing agent as it isnot connected to a network 516. In at least one embodiment, thecommunication engine 502 sends communication requests to thecommunication engine 512 of the second GPU 520 via P2P connection 522,and the communication engine 512 can determine if the communicationrequests should be coalesced by the coalescing agent 514. The coalescingagent 514 can coalesce the communication requests into a coalescedrequest and send the coalesced request to a third GPU 530 via thenetwork 516. The third GPU 530 includes memory 538 and communicationengine 532, which includes a coalescing agent 534. The communicationengine 532 can determine if a received request needs to be uncoalescedand use the coalescing agent 534 to uncoalesce those requests.

FIG. 6 is a network diagram of a first GPU 610 coupled to a CPU 620having a coalescing agent 614, in accordance with at least someembodiments. The first GPU 610 executes a shared memory application 608that sends or receives communication requests to and from other GPUs.The communication engine 602 may not include a coalescing agent as it isnot connected to a network 616. In at least one embodiment, thecommunication engine 602 sends communication requests to thecommunication engine 612 of the second GPU 620 via P2P connection 622,and the communication engine 612 can determine if the communicationrequests should be coalesced by the coalescing agent 614. The coalescingagent 614 can coalesce the communication requests into a coalescedrequest and send the coalesced request to a third GPU 630 via thenetwork 616. The third GPU 630 includes memory 638 and communicationengine 632, which includes a coalescing agent 634. The communicationengine 632 can determine if a received request needs to be uncoalescedand use the coalescing agent 634 to uncoalesce those requests.

FIG. 7 is a network diagram of a first GPU 710 coupled to a hardwareoffload circuit 720 having a coalescing agent 714, in accordance with atleast some embodiments. The first GPU 710 executes a shared memoryapplication 708 that sends or receives communication requests to andfrom other GPUs. The communication engine 702 includes a hardwarecommunication engine 720 connected to a network 716. In at least oneembodiment, the shared memory application 708 sends communicationrequests to the hardware communication engine 720, and the hardwarecommunication engine 720 can determine if the communication requestsshould be coalesced by the coalescing agent 714. The coalescing agent714 can coalesce the communication requests into a coalesced request andsend the coalesced request to a third GPU 730 via the network 716. Thethird GPU 730 includes memory 738 and communication engine 732, whichincludes a coalescing agent 734. The communication engine 732 can alsobe a hardware communication engine. The communication engine 732 candetermine if a received request needs to be uncoalesced and use thecoalescing agent 734 to uncoalesce those requests.

FIG. 8 is a block diagram of a computing device 800 having a GPU with acoalescing agent 104, in accordance with at least some embodiments. Insome embodiments, computing device 800 may include engines of acustomizable pipeline, including training engine 820, export engine 830,build engine 850, deployment engine 870, and inference engine 880.Although FIG. 8 depicts all engines as part of the same computingdevice, in some implementations, any of the engines shown may in fact beimplemented on different computing devices, including virtual computingdevices, cloud-based processing devices, and the like. For example,computing device 800 may include inference engine 880 but not otherengines of the customizable pipeline. Inference engine 880 (and/or anyother engines of the pipeline) may be executed by one or more GPUs 810to perform speech recognition, object recognition, or any otherinferencing involving machine learning. In some embodiments, a GPU 810includes multiple cores 811, each core being capable of executingmultiple threads 812. Each core may run multiple threads 812concurrently (e.g., in parallel). In some embodiments, threads 812 mayhave access to registers 813. Registers 813 may be thread-specificregisters with access to a register restricted to a respective thread.Additionally, shared registers 814 may be accessed by all threads of thecore. In some embodiments, each core 811 may include a scheduler 815 todistribute computational tasks and processes among different threads 812of core 811. A dispatch unit 816 may implement scheduled tasks onappropriate threads using correct private registers 813 and sharedregisters 814. Computing device 800 may include input/outputcomponent(s) 834 to facilitate exchange of information with one or moreusers or developers.

In some embodiments, GPU 810 may have a (high-speed) cache 818, accessto which may be shared by multiple cores 811. Furthermore, computingdevice 800 may include a GPU memory 819 where GPU 810 may storeintermediate and/or final results (outputs) of various computationsperformed by GPU 810. After completion of a particular task, GPU 810 (orCPU 830) may move the output to (main) memory 804. In some embodiments,CPU 830 may execute processes that involve serial computational tasks(assigned by one of the engines of the pipeline) whereas GPU 810 mayexecute tasks (such as multiplication of inputs of a neural node byweights and adding biases) that are amenable to parallel processing. Insome embodiments, a respective engine of the pipeline (e.g., buildengine 850, inference engine 880, etc.) may determine which processesmanaged by the respective engine are to be executed on GPU 810 and whichprocesses are to be executed on CPU 830. In some embodiments, CPU 830may determine which processes are to be executed on GPU 810 and whichprocesses are to be executed on CPU 830.

As illustrated in FIG. 8 , the GPU 810 includes a coalescing agent 104.The coalescing agent 104 can be part of a communication engine of theGPU 810. The coalescing agent 104 operates in a similar manner as thecommunication engines and coalescing agents described above with respectto FIGS. 1-7 . In at least one embodiment, the GPU 810 sends andreceives communication requests with other GPUs via P2P connections andnetwork connections. For communication request over the network, thecoalescing agent 104 determines whether the communication requestsatisfies a coalescing criterion. When the communication requestsatisfies the coalescing criterion, the coalescing agent stores thecommunication request in associated with a group of requests that have acommon property. The coalescing agent 104 whether a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion. The coalescing agent 104 coalesces the group ofrequest into a coalesced request in response to the timer expiring orthe size of the group satisfies the group size criterion. The coalescingagent 104 sends the coalesced request to a second GPU over a network.

Data Center

FIG. 9 illustrates an example data center 900, in which at least oneembodiment may be used. In at least one embodiment, data center 900includes a data center infrastructure layer 910, a framework layer 920,a software layer 930, and an application layer 940.

In at least one embodiment, as shown in FIG. 9 , data centerinfrastructure layer 910 may include a resource orchestrator 912,grouped computing resources 914, and node computing resources (“nodeC.R.s”) 916(1)-916(N), where “N” represents any whole, positive integer.In at least one embodiment, node C.R.s 916(1)-916(N) may include, butare not limited to, any number of central processing units (“CPUs”) orother processors (including accelerators, field programmable gate arrays(FPGAs), graphics processors, etc.), memory devices (e.g., dynamicread-only memory), storage devices (e.g., solid state or disk drives),network input/output (“NW I/O”) devices, network switches, virtualmachines (“VMs”), power modules, and cooling modules, etc. In at leastone embodiment, one or more node C.R.s from among node C.R.s916(1)-916(N) may be a server having one or more of above-mentionedcomputing resources.

In at least one embodiment, grouped computing resources 914 may includeseparate groupings of node C.R.s housed within one or more racks (notshown), or many racks housed in data centers at various geographicallocations (also not shown). Separate groupings of node C.R.s withingrouped computing resources 914 may include grouped compute, network,memory or storage resources that may be configured or allocated tosupport one or more workloads. In at least one embodiment, several nodeC.R.s including CPUs or processors may grouped within one or more racksto provide compute resources to support one or more workloads. In atleast one embodiment, one or more racks may also include any number ofpower modules, cooling modules, and network switches, in anycombination.

In at least one embodiment, resource orchestrator 912 may configure orotherwise control one or more node C.R.s 916(1)-916(N) and/or groupedcomputing resources 914. In at least one embodiment, resourceorchestrator 912 may include a software design infrastructure (“SDI”)management entity for data center 900. In at least one embodiment,resource orchestrator may include hardware, software or some combinationthereof.

In at least one embodiment, as shown in FIG. 9 , framework layer 920includes a job scheduler 922, a configuration manager 924, a resourcemanager 926 and a distributed file system 928. In at least oneembodiment, framework layer 920 may include a framework to supportsoftware 932 of software layer 930 and/or one or more application(s) 942of application layer 940. In at least one embodiment, software 932 orapplication(s) 942 may respectively include web-based service softwareor applications, such as those provided by Amazon Web Services, GoogleCloud and Microsoft Azure. In at least one embodiment, framework layer920 may be, but is not limited to, a type of free and open-sourcesoftware web application framework such as Apache Spark™ (hereinafter“Spark”) that may utilize distributed file system 928 for large-scaledata processing (e.g., “big data”). In at least one embodiment, jobscheduler 922 may include a Spark driver to facilitate scheduling ofworkloads supported by various layers of data center 900. In at leastone embodiment, configuration manager 924 may be capable of configuringdifferent layers such as software layer 930 and framework layer 920including Spark and distributed file system 928 for supportinglarge-scale data processing. In at least one embodiment, resourcemanager 926 may be capable of managing clustered or grouped computingresources mapped to or allocated for support of distributed file system928 and job scheduler 922. In at least one embodiment, clustered orgrouped computing resources may include grouped computing resource 914at data center infrastructure layer 910. In at least one embodiment,resource manager 926 may coordinate with resource orchestrator 912 tomanage these mapped or allocated computing resources.

In at least one embodiment, software 932 included in software layer 930may include software used by at least portions of node C.R.s916(1)-916(N), grouped computing resources 914, and/or distributed filesystem 928 of framework layer 920. The one or more types of software mayinclude, but are not limited to, Internet web page search software,e-mail virus scan software, database software, and streaming videocontent software.

In at least one embodiment, application(s) 942 included in applicationlayer 940 may include one or more types of applications used by at leastportions of node C.R.s 916(1)-916(N), grouped computing resources 914,and/or distributed file system 928 of framework layer 920. One or moretypes of applications may include, but are not limited to, any number ofa genomics application, a cognitive compute, and a machine learningapplication, including training or inferencing software, machinelearning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.) orother machine learning applications used in conjunction with one or moreembodiments.

In at least one embodiment, any of configuration manager 924, resourcemanager 926, and resource orchestrator 912 may implement any number andtype of self-modifying actions based on any amount and type of dataacquired in any technically feasible fashion. In at least oneembodiment, self-modifying actions may relieve a data center operator ofdata center 900 from making possibly bad configuration decisions andpossibly avoiding underutilized and/or poor performing portions of adata center.

In at least one embodiment, data center 900 may include tools, services,software or other resources to train one or more machine learning modelsor predict or infer information using one or more machine learningmodels according to one or more embodiments described herein. Forexample, in at least one embodiment, a machine learning model may betrained by calculating weight parameters according to a neural networkarchitecture using software and computing resources described above withrespect to data center 900. In at least one embodiment, trained machinelearning models corresponding to one or more neural networks may be usedto infer or predict information using resources described above withrespect to data center 900 by using weight parameters calculated throughone or more training techniques described herein.

In at least one embodiment, data center may use CPUs,application-specific integrated circuits (ASICs), GPUs, FPGAs, or otherhardware to perform training and/or inferencing using above-describedresources. Moreover, one or more software and/or hardware resourcesdescribed above may be configured as a service to allow users to trainor performing inferencing of information, such as image recognition,speech recognition, or other artificial intelligence services.

Communication engine and coalescing agent logic 904 are used to performautomatic coalescing of communication requests over a network. Detailsregarding communication engine and coalescing agent logic 904 areprovided above with respect to FIGS. 1-8 . In at least one embodiment,communication engine and coalescing agent logic 904 may be used insystem FIG. 9 for automatic coalescing operations for computing systemsusing neural network training operations, neural network functionsand/or architectures, or neural network use cases described herein.

Such components can be used to generate synthetic data imitating failurecases in a network training process, which can help to improveperformance of the network while limiting the amount of synthetic datato avoid overfitting.

Autonomous Vehicle

FIG. 10A illustrates an example of an autonomous vehicle 1000, accordingto at least one embodiment. In at least one embodiment, autonomousvehicle 1000 (alternatively referred to herein as “vehicle 1000”) maybe, without limitation, a passenger vehicle, such as a car, a truck, abus, and/or another type of vehicle that accommodates one or morepassengers. In at least one embodiment, vehicle 1000 may be asemi-tractor-trailer truck used for hauling cargo. In at least oneembodiment, vehicle 1000 may be an airplane, robotic vehicle, or otherkind of vehicle.

Autonomous vehicles may be described in terms of automation levels,defined by National Highway Traffic Safety Administration (“NHTSA”), adivision of US Department of Transportation, and Society of AutomotiveEngineers (“SAE”) “Taxonomy and Definitions for Terms Related to DrivingAutomation Systems for On-Road Motor Vehicles” (e.g., Standard No.J3016-201806, published on Jun. 15, 2018, Standard No. J3016-201609,published on Sep. 30, 2016, and previous and future versions of thisstandard). In one or more embodiments, vehicle 1000 may be capable offunctionality in accordance with one or more of level 1-level 5 ofautonomous driving levels. For example, in at least one embodiment,vehicle 1000 may be capable of conditional automation (Level 3), highautomation (Level 4), and/or full automation (Level 5), depending onembodiment.

In at least one embodiment, vehicle 1000 may include, withoutlimitation, components such as a chassis, a vehicle body, wheels (e.g.,2, 4, 6, 8, 18, etc.), tires, axles, and other components of a vehicle.In at least one embodiment, vehicle 1000 may include, withoutlimitation, a propulsion system 1050, such as an internal combustionengine, hybrid electric power plant, an all-electric engine, and/oranother propulsion system type. In at least one embodiment, propulsionsystem 1050 may be connected to a drive train of vehicle 1000, which mayinclude, without limitation, a transmission, to enable propulsion ofvehicle 1000. In at least one embodiment, propulsion system 1050 may becontrolled in response to receiving signals from athrottle/accelerator(s) 1052.

In at least one embodiment, a steering system 1054, which may include,without limitation, a steering wheel, is used to steer a vehicle 1000(e.g., along a desired path or route) when a propulsion system 1050 isoperating (e.g., when vehicle is in motion). In at least one embodiment,a steering system 1054 may receive signals from steering actuator(s)1056. A steering wheel may be optional for full automation (Level 5)functionality. In at least one embodiment, a brake sensor system 1046may be used to operate vehicle brakes in response to receiving signalsfrom brake actuator(s) 1048 and/or brake sensors.

In at least one embodiment, controller(s) 1036, which may include,without limitation, one or more system on chips (“SoCs”) (not shown inFIG. 10A) and/or graphics processing unit(s) (“GPU(s)”), provide signals(e.g., representative of commands) to one or more components and/orsystems of vehicle 1000. For instance, in at least one embodiment,controller(s) 1036 may send signals to operate vehicle brakes via brakeactuator(s) 1048, to operate steering system 1054 via steeringactuator(s) 1056, and/or to operate propulsion system 1050 viathrottle/accelerator(s) 1052. Controller(s) 1036 may include one or moreonboard (e.g., integrated) computing devices (e.g., supercomputers) thatprocess sensor signals, and output operation commands (e.g., signalsrepresenting commands) to enable autonomous driving and/or to assist ahuman driver in driving vehicle 1000. In at least one embodiment,controller(s) 1036 may include a first controller 1036 for autonomousdriving functions, a second controller 1036 for functional safetyfunctions, a third controller 1036 for artificial intelligencefunctionality (e.g., computer vision), a fourth controller 1036 forinfotainment functionality, a fifth controller 1036 for redundancy inemergency conditions, and/or other controllers. In at least oneembodiment, a single controller 1036 may handle two or more of abovefunctionalities, two or more controllers 1036 may handle a singlefunctionality, and/or any combination thereof.

In at least one embodiment, controller(s) 1036 provide signals forcontrolling one or more components and/or systems of vehicle 1000 inresponse to sensor data received from one or more sensors (e.g., sensorinputs). In at least one embodiment, sensor data may be received from,for example and without limitation, global navigation satellite systems(“GNSS”) sensor(s) 1058 (e.g., Global Positioning System sensor(s)),RADAR sensor(s) 1060, ultrasonic sensor(s) 1062, LIDAR sensor(s) 1064,inertial measurement unit (“IMU”) sensor(s) 1066 (e.g.,accelerometer(s), gyroscope(s), magnetic compass(es), magnetometer(s),etc.), microphone(s) 1096, stereo camera(s) 1068, wide-view camera(s)1070 (e.g., fisheye cameras), infrared camera(s) 1072, surroundcamera(s) 1074 (e.g., 360 degree cameras), long-range cameras (not shownin FIG. 10A), mid-range camera(s) (not shown in FIG. 10A), speedsensor(s) 1044 (e.g., for measuring speed of vehicle 1000), vibrationsensor(s) 1042, steering sensor(s) 1040, brake sensor(s) (e.g., as partof brake sensor system 1046), and/or other sensor types.

In at least one embodiment, one or more of controller(s) 1036 mayreceive inputs (e.g., represented by input data) from an instrumentcluster 1032 of vehicle 1000 and provide outputs (e.g., represented byoutput data, display data, etc.) via a human-machine interface (“HMI”)display 1034, an audible annunciator, a loudspeaker, and/or via othercomponents of vehicle 1000. In at least one embodiment, outputs mayinclude information such as vehicle velocity, speed, time, map data(e.g., a High Definition map (not shown in FIG. 10A), location data(e.g., vehicle 1000's location, such as on a map), direction, locationof other vehicles (e.g., an occupancy grid), information about objectsand status of objects as perceived by controller(s) 1036, etc. Forexample, in at least one embodiment, HMI display 1034 may displayinformation about presence of one or more objects (e.g., a street sign,caution sign, traffic light changing, etc.), and/or information aboutdriving maneuvers vehicle has made, is making, or will make (e.g.,changing lanes now, taking exit 34B in two miles, etc.).

In at least one embodiment, vehicle 1000 further includes a networkinterface 1024 which may use wireless antenna(s) 1026 and/or modem(s) tocommunicate over one or more networks. For example, in at least oneembodiment, network interface 1024 may be capable of communication overLong-Term Evolution (“LTE”), Wideband Code Division Multiple Access(“WCDMA”), Universal Mobile Telecommunications System (“UMTS”), GlobalSystem for Mobile communication (“GSM”), IMT-CDMA Multi-Carrier(“CDMA2000”), etc. In at least one embodiment, wireless antenna(s) 1026may also enable communication between objects in environment (e.g.,vehicles, mobile devices, etc.), using local area network(s), such asBluetooth, Bluetooth Low Energy (“LE”), Z-Wave, ZigBee, etc., and/or lowpower wide-area network(s) (“LPWANs”), such as LoRaWAN, SigFox, etc.

Communication engine and coalescing agent logic 904 are used to performautomatic coalescing of communication requests over a network. Detailsregarding communication engine and coalescing agent logic 904 areprovided above with respect to FIGS. 1-8 . In at least one embodiment,communication engine and coalescing agent logic 904 may be used insystem FIG. 10A for automatic coalescing operations for computingsystems using neural network training operations, neural networkfunctions and/or architectures, or neural network use cases describedherein.

Such components can be used to generate synthetic data imitating failurecases in a network training process, which can help to improveperformance of the network while limiting the amount of synthetic datato avoid overfitting.

FIG. 10B illustrates an example of camera locations and fields of viewfor autonomous vehicle 1000 of FIG. 10A, according to at least oneembodiment. In at least one embodiment, cameras and respective fields ofview are one example embodiment and are not intended to be limiting. Forinstance, in at least one embodiment, additional and/or alternativecameras may be included and/or cameras may be located at differentlocations on vehicle 1000.

In at least one embodiment, camera types for cameras may include, butare not limited to, digital cameras that may be adapted for use withcomponents and/or systems of vehicle 1000. In at least one embodiment,one or more of camera(s) may operate at automotive safety integritylevel (“ASIL”) B and/or at another ASIL. In at least one embodiment,camera types may be capable of any image capture rate, such as 60 framesper second (fps), 120 fps, 240 fps, etc., depending on embodiment. In atleast one embodiment, cameras may be capable of using rolling shutters,global shutters, another type of shutter, or a combination thereof. Inat least one embodiment, color filter array may include a red clearclear clear (“RCCC”) color filter array, a red clear clear blue (“RCCB”)color filter array, a red blue green clear (“RBGC”) color filter array,a Foveon X3 color filter array, a Bayer sensors (“RGGB”) color filterarray, a monochrome sensor color filter array, and/or another type ofcolor filter array. In at least one embodiment, clear pixel cameras,such as cameras with an RCCC, an RCCB, and/or an RBGC color filterarray, may be used in an effort to increase light sensitivity.

In at least one embodiment, one or more of camera(s) may be used toperform advanced driver assistance systems (“ADAS”) functions (e.g., aspart of a redundant or fail-safe design). For example, in at least oneembodiment, a Multi-Function Mono Camera may be installed to providefunctions including lane departure warning, traffic sign assist andintelligent headlamp control. In at least one embodiment, one or more ofcamera(s) (e.g., all of cameras) may record and provide image data(e.g., video) simultaneously.

In at least one embodiment, one or more of cameras may be mounted in amounting assembly, such as a custom designed (three-dimensional (“3D”)printed) assembly, in order to cut out stray light and reflections fromwithin car (e.g., reflections from dashboard reflected in windshieldmirrors) which may interfere with camera's image data capture abilities.With reference to wing-mirror mounting assemblies, in at least oneembodiment, wing-mirror assemblies may be custom 3D printed so thatcamera mounting plate matches shape of wing-mirror. In at least oneembodiment, camera(s) may be integrated into wing-mirror. For side-viewcameras, camera(s) may also be integrated within four pillars at eachcorner.

In at least one embodiment, cameras with a field of view that includeportions of environment in front of vehicle 1000 (e.g., front-facingcameras) may be used for surround view, to help identify forward facingpaths and obstacles, as well as aid in, with help of one or more ofcontrollers 1036 and/or control SoCs, providing information critical togenerating an occupancy grid and/or determining preferred vehicle paths.In at least one embodiment, front-facing cameras may be used to performmany of same ADAS functions as LIDAR, including, without limitation,emergency braking, pedestrian detection, and collision avoidance. In atleast one embodiment, front-facing cameras may also be used for ADASfunctions and systems including, without limitation, Lane DepartureWarnings (“LDW”), Autonomous Cruise Control (“ACC”), and/or otherfunctions such as traffic sign recognition.

In at least one embodiment, a variety of cameras may be used in afront-facing configuration, including, for example, a monocular cameraplatform that includes a CMOS (“complementary metal oxidesemiconductor”) color imager. In at least one embodiment, wide-viewcamera 1070 may be used to perceive objects coming into view fromperiphery (e.g., pedestrians, crossing traffic or bicycles). Althoughonly one wide-view camera 1070 is illustrated in FIG. 10B, in otherembodiments, there may be any number (including zero) of wide-viewcamera(s) 1070 on vehicle 1000. In at least one embodiment, any numberof long-range camera(s) 1098 (e.g., a long-view stereo camera pair) maybe used for depth-based object detection, especially for objects forwhich a neural network has not yet been trained. In at least oneembodiment, long-range camera(s) 1098 may also be used for objectdetection and classification, as well as basic object tracking.

In at least one embodiment, any number of stereo camera(s) 1068 may alsobe included in a front-facing configuration. In at least one embodiment,one or more of stereo camera(s) 1068 may include an integrated controlunit comprising a scalable processing unit, which may provide aprogrammable logic (“FPGA”) and a multi-core micro-processor with anintegrated Controller Area Network (“CAN”) or Ethernet interface on asingle chip. In at least one embodiment, such a unit may be used togenerate a 3D map of environment of vehicle 1000, including a distanceestimate for all points in image. In at least one embodiment, one ormore of stereo camera(s) 1068 may include, without limitation, compactstereo vision sensor(s) that may include, without limitation, two cameralenses (one each on left and right) and an image processing chip thatmay measure distance from vehicle 1000 to target object and usegenerated information (e.g., metadata) to activate autonomous emergencybraking and lane departure warning functions. In at least oneembodiment, other types of stereo camera(s) 1068 may be used in additionto, or alternatively from, those described herein.

In at least one embodiment, cameras with a field of view that includeportions of environment to side of vehicle 1000 (e.g., side-viewcameras) may be used for surround view, providing information used tocreate and update occupancy grid, as well as to generate side impactcollision warnings. For example, in at least one embodiment, surroundcamera(s) 1074 (e.g., four surround cameras 1074 as illustrated in FIG.10B) could be positioned on vehicle 1000. In at least one embodiment,surround camera(s) 1074 may include, without limitation, any number andcombination of wide-view camera(s) 1070, fisheye camera(s), 360 degreecamera(s), and/or like. For instance, in at least one embodiment, fourfisheye cameras may be positioned on front, rear, and sides of vehicle1000. In at least one embodiment, vehicle 1000 may use three surroundcamera(s) 1074 (e.g., left, right, and rear), and may leverage one ormore other camera(s) (e.g., a forward-facing camera) as a fourthsurround-view camera.

In at least one embodiment, cameras with a field of view that includeportions of environment to rear of vehicle 1000 (e.g., rear-viewcameras) may be used for park assistance, surround view, rear collisionwarnings, and creating and updating occupancy grid. In at least oneembodiment, a wide variety of cameras may be used including, but notlimited to, cameras that are also suitable as a front-facing camera(s)(e.g., long-range cameras 1098 and/or mid-range camera(s) 1076, stereocamera(s) 1068), infrared camera(s) 1072, etc.), as described herein.

Communication engine and coalescing agent logic 904 are used to performautomatic coalescing of communication requests over a network. Detailsregarding communication engine and coalescing agent logic 904 areprovided above with respect to FIGS. 1-8 . In at least one embodiment,communication engine and coalescing agent logic 904 may be used insystem FIG. 10B for automatic coalescing operations for computingsystems using neural network training operations, neural networkfunctions and/or architectures, or neural network use cases describedherein.

Such components can be used to generate synthetic data imitating failurecases in a network training process, which can help to improveperformance of the network while limiting the amount of synthetic datato avoid overfitting.

FIG. 10C is a block diagram illustrating an example system architecturefor autonomous vehicle 1000 of FIG. 10A, according to at least oneembodiment. In at least one embodiment, each of components, features,and systems of vehicle 1000 in FIG. 10C are illustrated as beingconnected via a bus 1002. In at least one embodiment, bus 1002 mayinclude, without limitation, a CAN data interface (alternativelyreferred to herein as a “CAN bus”). In at least one embodiment, a CANbus may be a network inside vehicle 1000 used to aid in control ofvarious features and functionality of vehicle 1000, such as actuation ofbrakes, acceleration, braking, steering, windshield wipers, etc. In atleast one embodiment, bus 1002 may be configured to have dozens or evenhundreds of nodes, each with its own unique identifier (e.g., a CAN ID).In at least one embodiment, bus 1002 may be read to find steering wheelangle, ground speed, engine revolutions per minute (“RPMs”), buttonpositions, and/or other vehicle status indicators. In at least oneembodiment, bus 1002 may be a CAN bus that is ASIL B compliant.

In at least one embodiment, in addition to, or alternatively from CAN,FlexRay and/or Ethernet may be used. In at least one embodiment, theremay be any number of busses 1002, which may include, without limitation,zero or more CAN busses, zero or more FlexRay busses, zero or moreEthernet busses, and/or zero or more other types of busses using adifferent protocol. In at least one embodiment, two or more busses 1002may be used to perform different functions, and/or may be used forredundancy. For example, a first bus 1002 may be used for collisionavoidance functionality and a second bus 1002 may be used for actuationcontrol. In at least one embodiment, each bus 1002 may communicate withany of components of vehicle 1000, and two or more busses 1002 maycommunicate with same components. In at least one embodiment, each ofany number of system(s) on chip(s) (“SoC(s)”) 1004, each ofcontroller(s) 1036, and/or each computer within vehicle may have accessto same input data (e.g., inputs from sensors of vehicle 1000), and maybe connected to a common bus, such CAN bus.

In at least one embodiment, vehicle 1000 may include one or morecontroller(s) 1036, such as those described herein with respect to FIG.10A. Controller(s) 1036 may be used for a variety of functions. In atleast one embodiment, controller(s) 1036 may be coupled to any ofvarious other components and systems of vehicle 1000, and may be usedfor control of vehicle 1000, artificial intelligence of vehicle 1000,infotainment for vehicle 1000, and/or like.

In at least one embodiment, vehicle 1000 may include any number of SoCs1004. Each of SoCs 1004 may include, without limitation, centralprocessing units (“CPU(s)”) 1006, graphics processing units (“GPU(s)”)1008, processor(s) 1010, cache(s) 1012, accelerator(s) 1014, datastore(s) 1010, and/or other components and features not illustrated. Inat least one embodiment, SoC(s) 1004 may be used to control vehicle 1000in a variety of platforms and systems. For example, in at least oneembodiment, SoC(s) 1004 may be combined in a system (e.g., system ofvehicle 1000) with a High Definition (“HD”) map 1022 which may obtainmap refreshes and/or updates via network interface 1024 from one or moreservers (not shown in FIG. 10C).

In at least one embodiment, CPU(s) 1006 may include a CPU cluster or CPUcomplex (alternatively referred to herein as a “CCPLEX”). In at leastone embodiment, CPU(s) 1006 may include multiple cores and/or level two(“L2”) caches. For instance, in at least one embodiment, CPU(s) 1006 mayinclude eight cores in a coherent multi-processor configuration. In atleast one embodiment, CPU(s) 1006 may include four dual-core clusterswhere each cluster has a dedicated L2 cache (e.g., a 2 MB L2 cache). Inat least one embodiment, CPU(s) 1006 (e.g., CCPLEX) may be configured tosupport simultaneous cluster operation enabling any combination ofclusters of CPU(s) 1006 to be active at any given time.

In at least one embodiment, one or more of CPU(s) 1006 may implementpower management capabilities that include, without limitation, one ormore of following features: individual hardware blocks may beclock-gated automatically when idle to save dynamic power; each coreclock may be gated when core is not actively executing instructions dueto execution of Wait for Interrupt (“WFI”)/Wait for Event (“WFE”)instructions; each core may be independently power-gated; each corecluster may be independently clock-gated when all cores are clock-gatedor power-gated; and/or each core cluster may be independentlypower-gated when all cores are power-gated. In at least one embodiment,CPU(s) 1006 may further implement an enhanced algorithm for managingpower states, where allowed power states and expected wakeup times arespecified, and hardware/microcode determines best power state to enterfor core, cluster, and CCPLEX. In at least one embodiment, processingcores may support simplified power state entry sequences in softwarewith work offloaded to microcode.

In at least one embodiment, GPU(s) 1008 may include an integrated GPU(alternatively referred to herein as an “iGPU”). In at least oneembodiment, GPU(s) 1008 may be programmable and may be efficient forparallel workloads. In at least one embodiment, GPU(s) 1008, in at leastone embodiment, may use an enhanced tensor instruction set. In at leastone embodiment, GPU(s) 1008 may include one or more streamingmicroprocessors, where each streaming microprocessor may include a levelone (“L1”) cache (e.g., an L1 cache with at least 96 KB storagecapacity), and two or more of streaming microprocessors may share an L2cache (e.g., an L2 cache with a 512 KB storage capacity). In at leastone embodiment, GPU(s) 1008 may include at least eight streamingmicroprocessors. In at least one embodiment, GPU(s) 1008 may use computeapplication programming interface(s) (API(s)). In at least oneembodiment, GPU(s) 1008 may use one or more parallel computing platformsand/or programming models (e.g., NVIDIA's CUDA).

In at least one embodiment, one or more of GPU(s) 1008 may bepower-optimized for best performance in automotive and embedded usecases. For example, in on embodiment, GPU(s) 1008 could be fabricated ona Fin field-effect transistor (“FinFET”). In at least one embodiment,each streaming microprocessor may incorporate a number ofmixed-precision processing cores partitioned into multiple blocks. Forexample, and without limitation, 64 PF32 cores and 32 PF64 cores couldbe partitioned into four processing blocks. In at least one embodiment,each processing block could be allocated 10 FP32 cores, 8 FP64 cores, 10INT32 cores, two mixed-precision NVIDIA TENSOR COREs for deep learningmatrix arithmetic, a level zero (“L0”) instruction cache, a warpscheduler, a dispatch unit, and/or a 64 KB register file. In at leastone embodiment, streaming microprocessors may include independentparallel integer and floating-point data paths to provide for efficientexecution of workloads with a mix of computation and addressingcalculations. In at least one embodiment, streaming microprocessors mayinclude independent thread scheduling capability to enable finer-grainsynchronization and cooperation between parallel threads. In at leastone embodiment, streaming microprocessors may include a combined L1 datacache and shared memory unit in order to improve performance whilesimplifying programming.

In at least one embodiment, one or more of GPU(s) 1008 may include ahigh bandwidth memory (″HBM) and/or a 10 GB HBM2 memory subsystem toprovide, in some examples, about 900 GB/second peak memory bandwidth. Inat least one embodiment, in addition to, or alternatively from, HBMmemory, a synchronous graphics random-access memory (“SGRAM”) may beused, such as a graphics double data rate type five synchronousrandom-access memory (“GDDR5”).

In at least one embodiment, GPU(s) 1008 may include unified memorytechnology. In at least one embodiment, address translation services(“ATS”) support may be used to allow GPU(s) 1008 to access CPU(s) 1006page tables directly. In at least one embodiment, embodiment, whenGPU(s) 1008 memory management unit (“MMU”) experiences a miss, anaddress translation request may be transmitted to CPU(s) 1006. Inresponse, CPU(s) 1006 may look in its page tables forvirtual-to-physical mapping for address and transmits translation backto GPU(s) 1008, in at least one embodiment. In at least one embodiment,unified memory technology may allow a single unified virtual addressspace for memory of both CPU(s) 1006 and GPU(s) 1008, therebysimplifying GPU(s) 1008 programming and porting of applications toGPU(s) 1008.

In at least one embodiment, GPU(s) 1008 may include any number of accesscounters that may keep track of frequency of access of GPU(s) 1008 tomemory of other processors. In at least one embodiment, accesscounter(s) may help ensure that memory pages are moved to physicalmemory of processor that is accessing pages most frequently, therebyimproving efficiency for memory ranges shared between processors.

In at least one embodiment, one or more of SoC(s) 1004 may include anynumber of cache(s) 1012, including those described herein. For example,in at least one embodiment, cache(s) 1012 could include a level three(“L3”) cache that is available to both CPU(s) 1006 and GPU(s) 1008(e.g., that is connected both CPU(s) 1006 and GPU(s) 1008). In at leastone embodiment, cache(s) 1012 may include a write-back cache that maykeep track of states of lines, such as by using a cache coherenceprotocol (e.g., MEI, MESI, MSI, etc.). In at least one embodiment, L3cache may include 4 MB or more, depending on embodiment, althoughsmaller cache sizes may be used.

In at least one embodiment, one or more of SoC(s) 1004 may include oneor more accelerator(s) 1014 (e.g., hardware accelerators, softwareaccelerators, or a combination thereof). In at least one embodiment,SoC(s) 1004 may include a hardware acceleration cluster that may includeoptimized hardware accelerators and/or large on-chip memory. In at leastone embodiment, large on-chip memory (e.g., 4 MB of SRAM), may enablehardware acceleration cluster to accelerate neural networks and othercalculations. In at least one embodiment, hardware acceleration clustermay be used to complement GPU(s) 1008 and to off-load some of tasks ofGPU(s) 1008 (e.g., to free up more cycles of GPU(s) 1008 for performingother tasks). In at least one embodiment, accelerator(s) 1014 could beused for targeted workloads (e.g., perception, convolutional neuralnetworks (“CNNs”), recurrent neural networks (“RNNs”), etc.) that arestable enough to be amenable to acceleration. In at least oneembodiment, a CNN may include a region-based or regional convolutionalneural networks (“RCNNs”) and Fast RCNNs (e.g., as used for objectdetection) or other type of CNN.

In at least one embodiment, accelerator(s) 1014 (e.g., hardwareacceleration cluster) may include a deep learning accelerator(s)(“DLA(s)”). DLA(s) may include, without limitation, one or more Tensorprocessing units (“TPU(s)”) that may be configured to provide anadditional ten trillion operations per second for deep learningapplications and inferencing. In at least one embodiment, TPU(s) may beaccelerators configured to, and optimized for, performing imageprocessing functions (e.g., for CNNs, RCNNs, etc.). DLA(s) may furtherbe optimized for a specific set of neural network types and floatingpoint operations, as well as inferencing. In at least one embodiment,design of DLA(s) may provide more performance per millimeter than atypical general-purpose GPU, and typically vastly exceeds performance ofa CPU. In at least one embodiment, TPU(s) may perform several functions,including a single-instance convolution function, supporting, forexample, INT8, INT16, and FP16 data types for both features and weights,as well as post-processor functions. In at least one embodiment, DLA(s)may quickly and efficiently execute neural networks, especially CNNs, onprocessed or unprocessed data for any of a variety of functions,including, for example and without limitation: a CNN for objectidentification and detection using data from camera sensors; a CNN fordistance estimation using data from camera sensors; a CNN for emergencyvehicle detection and identification and detection using data frommicrophones 1096; a CNN for facial recognition and vehicle owneridentification using data from camera sensors; and/or a CNN for securityand/or safety related events.

In at least one embodiment, DLA(s) may perform any function of GPU(s)1008, and by using an inference accelerator, for example, a designer maytarget either DLA(s) or GPU(s) 1008 for any function. For example, in atleast one embodiment, designer may focus processing of CNNs and floatingpoint operations on DLA(s) and leave other functions to GPU(s) 1008and/or other accelerator(s) 1014.

In at least one embodiment, accelerator(s) 1014 (e.g., hardwareacceleration cluster) may include a programmable vision accelerator(s)(“PVA”), which may alternatively be referred to herein as a computervision accelerator. In at least one embodiment, PVA(s) may be designedand configured to accelerate computer vision algorithms for advanceddriver assistance system (“ADAS”) 1038, autonomous driving, augmentedreality (“AR”) applications, and/or virtual reality (“VR”) applications.PVA(s) may provide a balance between performance and flexibility. Forexample, in at least one embodiment, each PVA(s) may include, forexample and without limitation, any number of reduced instruction setcomputer (“RISC”) cores, direct memory access (“DMA”), and/or any numberof vector processors.

In at least one embodiment, RISC cores may interact with image sensors(e.g., image sensors of any of cameras described herein), image signalprocessor(s), and/or like. In at least one embodiment, each of RISCcores may include any amount of memory. In at least one embodiment, RISCcores may use any of a number of protocols, depending on embodiment. Inat least one embodiment, RISC cores may execute a real-time operatingsystem (“RTOS”). In at least one embodiment, RISC cores may beimplemented using one or more integrated circuit devices, applicationspecific integrated circuits (“ASICs”), and/or memory devices. Forexample, in at least one embodiment, RISC cores could include aninstruction cache and/or a tightly coupled RAM.

In at least one embodiment, DMA may enable components of PVA(s) toaccess system memory independently of CPU(s) 1006. In at least oneembodiment, DMA may support any number of features used to provideoptimization to PVA including, but not limited to, supportingmulti-dimensional addressing and/or circular addressing. In at least oneembodiment, DMA may support up to six or more dimensions of addressing,which may include, without limitation, block width, block height, blockdepth, horizontal block stepping, vertical block stepping, and/or depthstepping.

In at least one embodiment, vector processors may be programmableprocessors that may be designed to efficiently and flexibly executeprogramming for computer vision algorithms and provide signal processingcapabilities. In at least one embodiment, PVA may include a PVA core andtwo vector processing subsystem partitions. In at least one embodiment,PVA core may include a processor subsystem, DMA engine(s) (e.g., two DMAengines), and/or other peripherals. In at least one embodiment, vectorprocessing subsystem may operate as primary processing engine of PVA,and may include a vector processing unit (“VPU”), an instruction cache,and/or vector memory (e.g., “VMEM”). In at least one embodiment, VPU mayinclude a digital signal processor such as, for example, a singleinstruction, multiple data (“SIMD”), very long instruction word (“VLIW”)digital signal processor. In at least one embodiment, a combination ofSIMD and VLIW may enhance throughput and speed.

In at least one embodiment, each of vector processors may include aninstruction cache and may be coupled to dedicated memory. As a result,in at least one embodiment, each of vector processors may be configuredto execute independently of other vector processors. In at least oneembodiment, vector processors that are included in a particular PVA maybe configured to employ data parallelism. For instance, in at least oneembodiment, plurality of vector processors included in a single PVA mayexecute same computer vision algorithm, but on different regions of animage. In at least one embodiment, vector processors included in aparticular PVA may simultaneously execute different computer visionalgorithms, on same image, or even execute different algorithms onsequential images or portions of an image. In at least one embodiment,among other things, any number of PVAs may be included in hardwareacceleration cluster and any number of vector processors may be includedin each of PVAs. In at least one embodiment, PVA(s) may includeadditional error correcting code (“ECC”) memory, to enhance overallsystem safety.

In at least one embodiment, accelerator(s) 1014 (e.g., hardwareacceleration cluster) may include a computer vision network on-chip andstatic random-access memory (“SRAM”), for providing a high-bandwidth,low latency SRAM for accelerator(s) 1014. In at least one embodiment,on-chip memory may include at least 4 MB SRAM, consisting of, forexample and without limitation, eight field-configurable memory blocks,that may be accessible by both PVA and DLA. In at least one embodiment,each pair of memory blocks may include an advanced peripheral bus(“APB”) interface, configuration circuitry, a controller, and amultiplexer. In at least one embodiment, any type of memory may be used.In at least one embodiment, PVA and DLA may access memory via a backbonethat provides PVA and DLA with high-speed access to memory. In at leastone embodiment, backbone may include a computer vision network on-chipthat interconnects PVA and DLA to memory (e.g., using APB).

In at least one embodiment, computer vision network on-chip may includean interface that determines, before transmission of any controlsignal/address/data, that both PVA and DLA provide ready and validsignals. In at least one embodiment, an interface may provide forseparate phases and separate channels for transmitting controlsignals/addresses/data, as well as burst-type communications forcontinuous data transfer. In at least one embodiment, an interface maycomply with International Organization for Standardization (“ISO”) 26262or International Electrotechnical Commission (“IEC”) 61508 standards,although other standards and protocols may be used.

In at least one embodiment, one or more of SoC(s) 1004 may include areal-time ray-tracing hardware accelerator. In at least one embodiment,real-time ray-tracing hardware accelerator may be used to quickly andefficiently determine positions and extents of objects (e.g., within aworld model), to generate real-time visualization simulations, for RADARsignal interpretation, for sound propagation synthesis and/or analysis,for simulation of SONAR systems, for general wave propagationsimulation, for comparison to LIDAR data for purposes of localizationand/or other functions, and/or for other uses.

In at least one embodiment, accelerator(s) 1014 (e.g., hardwareaccelerator cluster) have a wide array of uses for autonomous driving.In at least one embodiment, PVA may be a programmable vision acceleratorthat may be used for key processing stages in ADAS and autonomousvehicles. In at least one embodiment, PVA's capabilities are a goodmatch for algorithmic domains needing predictable processing, at lowpower and low latency. In other words, PVA performs well on semi-denseor dense regular computation, even on small data sets, which needpredictable run-times with low latency and low power. In at least oneembodiment, autonomous vehicles, such as vehicle 1000, PVAs are designedto run classic computer vision algorithms, as they are efficient atobject detection and operating on integer math.

For example, according to at least one embodiment of technology, PVA isused to perform computer stereo vision. In at least one embodiment,semi-global matching-based algorithm may be used in some examples,although this is not intended to be limiting. In at least oneembodiment, applications for Level 3-5 autonomous driving use motionestimation/stereo matching on-the-fly (e.g., structure from motion,pedestrian recognition, lane detection, etc.). In at least oneembodiment, PVA may perform computer stereo vision function on inputsfrom two monocular cameras.

In at least one embodiment, PVA may be used to perform dense opticalflow. For example, in at least one embodiment, PVA could process rawRADAR data (e.g., using a 4D Fast Fourier Transform) to provideprocessed RADAR data. In at least one embodiment, PVA is used for timeof flight depth processing, by processing raw time of flight data toprovide processed time of flight data, for example.

In at least one embodiment, DLA may be used to run any type of networkto enhance control and driving safety, including for example and withoutlimitation, a neural network that outputs a measure of confidence foreach object detection. In at least one embodiment, confidence may berepresented or interpreted as a probability, or as providing a relative“weight” of each detection compared to other detections. In at least oneembodiment, confidence enables a system to make further decisionsregarding which detections should be considered as true positivedetections rather than false positive detections. For example, In atleast one embodiment, a system may set a threshold value for confidenceand consider only detections exceeding threshold value as true positivedetections. In an embodiment in which an automatic emergency braking(“AEB”) system is used, false positive detections would cause vehicle toautomatically perform emergency braking, which is obviously undesirable.In at least one embodiment, highly confident detections may beconsidered as triggers for AEB. In at least one embodiment, DLA may runa neural network for regressing confidence value. In at least oneembodiment, neural network may take as its input at least some subset ofparameters, such as bounding box dimensions, ground plane estimateobtained (e.g. from another subsystem), output from IMU sensor(s) 1066that correlates with vehicle 1000 orientation, distance, 3D locationestimates of object obtained from neural network and/or other sensors(e.g., LIDAR sensor(s) 1064 or RADAR sensor(s) 1060), among others.

In at least one embodiment, one or more of SoC(s) 1004 may include datastore(s) 1016 (e.g., memory). In at least one embodiment, data store(s)1016 may be on-chip memory of SoC(s) 1004, which may store neuralnetworks to be executed on GPU(s) 1008 and/or DLA. In at least oneembodiment, data store(s) 1016 may be large enough in capacity to storemultiple instances of neural networks for redundancy and safety. In atleast one embodiment, data store(s) 1016 may comprise L2 or L3 cache(s).

In at least one embodiment, one or more of SoC(s) 1004 may include anynumber of processor(s) 1010 (e.g., embedded processors). In at least oneembodiment, processor(s) 1010 may include a boot and power managementprocessor that may be a dedicated processor and subsystem to handle bootpower and management functions and related security enforcement. In atleast one embodiment, boot and power management processor may be a partof SoC(s) 1004 boot sequence and may provide runtime power managementservices. In at least one embodiment, boot power and managementprocessor may provide clock and voltage programming, assistance insystem low power state transitions, management of SoC(s) 1004 thermalsand temperature sensors, and/or management of SoC(s) 1004 power states.In at least one embodiment, each temperature sensor may be implementedas a ring-oscillator whose output frequency is proportional totemperature, and SoC(s) 1004 may use ring-oscillators to detecttemperatures of CPU(s) 1006, GPU(s) 1008, and/or accelerator(s) 1014. Inat least one embodiment, if temperatures are determined to exceed athreshold, then boot and power management processor may enter atemperature fault routine and put SoC(s) 1004 into a lower power stateand/or put vehicle 1000 into a chauffeur to safe stop mode (e.g., bringvehicle 1000 to a safe stop).

In at least one embodiment, processor(s) 1010 may further include a setof embedded processors that may serve as an audio processing engine. Inat least one embodiment, audio processing engine may be an audiosubsystem that enables full hardware support for multi-channel audioover multiple interfaces, and a broad and flexible range of audio I/Ointerfaces. In at least one embodiment, audio processing engine is adedicated processor core with a digital signal processor with dedicatedRAM.

In at least one embodiment, processor(s) 1010 may further include analways on processor engine that may provide necessary hardware featuresto support low power sensor management and wake use cases. In at leastone embodiment, always on processor engine may include, withoutlimitation, a processor core, a tightly coupled RAM, supportingperipherals (e.g., timers and interrupt controllers), various I/Ocontroller peripherals, and routing logic.

In at least one embodiment, processor(s) 1010 may further include asafety cluster engine that includes, without limitation, a dedicatedprocessor subsystem to handle safety management for automotiveapplications. In at least one embodiment, safety cluster engine mayinclude, without limitation, two or more processor cores, a tightlycoupled RAM, support peripherals (e.g., timers, an interrupt controller,etc.), and/or routing logic. In a safety mode, two or more cores mayoperate, in at least one embodiment, in a lockstep mode and function asa single core with comparison logic to detect any differences betweentheir operations. In at least one embodiment, processor(s) 1010 mayfurther include a real-time camera engine that may include, withoutlimitation, a dedicated processor subsystem for handling real-timecamera management. In at least one embodiment, processor(s) 1010 mayfurther include a high-dynamic range signal processor that may include,without limitation, an image signal processor that is a hardware enginethat is part of camera processing pipeline.

In at least one embodiment, processor(s) 1010 may include a video imagecompositor that may be a processing block (e.g., implemented on amicroprocessor) that implements video post-processing functions neededby a video playback application to produce final image for playerwindow. In at least one embodiment, video image compositor may performlens distortion correction on wide-view camera(s) 1070, surroundcamera(s) 1074, and/or on in-cabin monitoring camera sensor(s). In atleast one embodiment, in-cabin monitoring camera sensor(s) arepreferably monitored by a neural network running on another instance ofSoC(s) 1004, configured to identify in cabin events and respondaccordingly. In at least one embodiment, an in-cabin system may perform,without limitation, lip reading to activate cellular service and place aphone call, dictate emails, change vehicle's destination, activate orchange vehicle's infotainment system and settings, or providevoice-activated web surfing. In at least one embodiment, certainfunctions are available to driver when vehicle is operating in anautonomous mode and are disabled otherwise.

In at least one embodiment, video image compositor may include enhancedtemporal noise reduction for both spatial and temporal noise reduction.For example, in at least one embodiment, where motion occurs in a video,noise reduction weights spatial information appropriately, decreasingweight of information provided by adjacent frames. In at least oneembodiment, where an image or portion of an image does not includemotion, temporal noise reduction performed by video image compositor mayuse information from previous image to reduce noise in current image.

In at least one embodiment, video image compositor may also beconfigured to perform stereo rectification on input stereo lens frames.In at least one embodiment, video image compositor may further be usedfor user interface composition when operating system desktop is in use,and GPU(s) 1008 are not required to continuously render new surfaces. Inat least one embodiment, when GPU(s) 1008 are powered on and activedoing 3D rendering, video image compositor may be used to offload GPU(s)1008 to improve performance and responsiveness.

In at least one embodiment, one or more of SoC(s) 1004 may furtherinclude a mobile industry processor interface (“MIPI”) camera serialinterface for receiving video and input from cameras, a high-speedinterface, and/or a video input block that may be used for camera andrelated pixel input functions. In at least one embodiment, one or moreof SoC(s) 1004 may further include an input/output controller(s) thatmay be controlled by software and may be used for receiving I/O signalsthat are uncommitted to a specific role.

In at least one embodiment, one or more of SoC(s) 1004 may furtherinclude a broad range of peripheral interfaces to enable communicationwith peripherals, audio encoders/decoders (“codecs”), power management,and/or other devices. SoC(s) 1004 may be used to process data fromcameras (e.g., connected over Gigabit Multimedia Serial Link andEthernet), sensors (e.g., LIDAR sensor(s) 1064, RADAR sensor(s) 1060,etc. that may be connected over Ethernet), data from bus 1002 (e.g.,speed of vehicle 1000, steering wheel position, etc.), data from GNSSsensor(s) 1058 (e.g., connected over Ethernet or CAN bus), etc. In atleast one embodiment, one or more of SoC(s) 1004 may further includededicated high-performance mass storage controllers that may includetheir own DMA engines, and that may be used to free CPU(s) 1006 fromroutine data management tasks.

In at least one embodiment, SoC(s) 1004 may be an end-to-end platformwith a flexible architecture that spans automation levels 3-5, therebyproviding a comprehensive functional safety architecture that leveragesand makes efficient use of computer vision and ADAS techniques fordiversity and redundancy, provides a platform for a flexible, reliabledriving software stack, along with deep learning tools. In at least oneembodiment, SoC(s) 1004 may be faster, more reliable, and even moreenergy-efficient and space-efficient than conventional systems. Forexample, in at least one embodiment, accelerator(s) 1014, when combinedwith CPU(s) 1006, GPU(s) 1008, and data store(s) 1016, may provide for afast, efficient platform for level 3-5 autonomous vehicles.

In at least one embodiment, computer vision algorithms may be executedon CPUs, which may be configured using high-level programming language,such as C programming language, to execute a wide variety of processingalgorithms across a wide variety of visual data. However, in at leastone embodiment, CPUs are oftentimes unable to meet performancerequirements of many computer vision applications, such as those relatedto execution time and power consumption, for example. In at least oneembodiment, many CPUs are unable to execute complex object detectionalgorithms in real-time, which are used in in-vehicle ADAS applicationsand in practical Level 3-5 autonomous vehicles.

Embodiments described herein allow for multiple neural networks to beperformed simultaneously and/or sequentially, and for results to becombined together to enable Level 3-5 autonomous driving functionality.For example, in at least one embodiment, a CNN executing on DLA ordiscrete GPU (e.g., GPU(s) 1020) may include text and word recognition,allowing supercomputer to read and understand traffic signs, includingsigns for which neural network has not been specifically trained. In atleast one embodiment, DLA may further include a neural network that isable to identify, interpret, and provide semantic understanding of sign,and to pass that semantic understanding to path planning modules runningon CPU Complex.

In at least one embodiment, multiple neural networks may be runsimultaneously, as for Level 3, 4, or 5 driving. For example, in atleast one embodiment, a warning sign consisting of “Caution: flashinglights indicate icy conditions,” along with an electric light, may beindependently or collectively interpreted by several neural networks. Inat least one embodiment, a sign itself may be identified as a trafficsign by a first deployed neural network (e.g., a neural network that hasbeen trained) and a text “flashing lights indicate icy conditions” maybe interpreted by a second deployed neural network, which informsvehicle's path planning software (preferably executing on CPU Complex)that when flashing lights are detected, icy conditions exist. In atleast one embodiment, a flashing light may be identified by operating athird deployed neural network over multiple frames, informing vehicle'spath-planning software of presence (or absence) of flashing lights. Inat least one embodiment, all three neural networks may runsimultaneously, such as within DLA and/or on GPU(s) 1008.

In at least one embodiment, a CNN for facial recognition and vehicleowner identification may use data from camera sensors to identifypresence of an authorized driver and/or owner of vehicle 1000. In atleast one embodiment, an always on sensor processing engine may be usedto unlock vehicle when owner approaches driver door and turn on lights,and, in security mode, to disable vehicle when owner leaves vehicle. Inthis way, SoC(s) 1004 provide for security against theft and/orcarjacking.

In at least one embodiment, a CNN for emergency vehicle detection andidentification may use data from microphones 1096 to detect and identifyemergency vehicle sirens. In at least one embodiment, SoC(s) 1004 useCNN for classifying environmental and urban sounds, as well asclassifying visual data. In at least one embodiment, CNN running on DLAis trained to identify relative closing speed of emergency vehicle(e.g., by using Doppler effect). In at least one embodiment, CNN mayalso be trained to identify emergency vehicles specific to local area inwhich vehicle is operating, as identified by GNSS sensor(s) 1058. In atleast one embodiment, when operating in Europe, CNN will seek to detectEuropean sirens, and when in United States CNN will seek to identifyonly North American sirens. In at least one embodiment, once anemergency vehicle is detected, a control program may be used to executean emergency vehicle safety routine, slowing vehicle, pulling over toside of road, parking vehicle, and/or idling vehicle, with assistance ofultrasonic sensor(s) 1062, until emergency vehicle(s) passes.

In at least one embodiment, vehicle 1000 may include CPU(s) 1018 (e.g.,discrete CPU(s), or dCPU(s)), that may be coupled to SoC(s) 1004 via ahigh-speed interconnect (e.g., PCIe). In at least one embodiment, CPU(s)1018 may include an X86 processor, for example. CPU(s) 1018 may be usedto perform any of a variety of functions, including arbitratingpotentially inconsistent results between ADAS sensors and SoC(s) 1004,and/or monitoring status and health of controller(s) 1036 and/or aninfotainment system on a chip (“infotainment SoC”) 1030, for example.

In at least one embodiment, vehicle 1000 may include GPU(s) 1020 (e.g.,discrete GPU(s), or dGPU(s)), that may be coupled to SoC(s) 1004 via ahigh-speed interconnect (e.g., NVIDIA's NVLINK). In at least oneembodiment, GPU(s) 1020 may provide additional artificial intelligencefunctionality, such as by executing redundant and/or different neuralnetworks, and may be used to train and/or update neural networks basedat least in part on input (e.g., sensor data) from sensors of vehicle1000.

In at least one embodiment, vehicle 1000 may further include networkinterface 1024 which may include, without limitation, wirelessantenna(s) 1026 (e.g., one or more wireless antennas 1026 for differentcommunication protocols, such as a cellular antenna, a Bluetoothantenna, etc.). In at least one embodiment, network interface 1024 maybe used to enable wireless connectivity over Internet with cloud (e.g.,with server(s) and/or other network devices), with other vehicles,and/or with computing devices (e.g., client devices of passengers). Inat least one embodiment, to communicate with other vehicles, a directlink may be established between vehicle 1000 and other vehicle and/or anindirect link may be established (e.g., across networks and overInternet). In at least one embodiment, direct links may be providedusing a vehicle-to-vehicle communication link. vehicle-to-vehiclecommunication link may provide vehicle 1000 information about vehiclesin proximity to vehicle 1000 (e.g., vehicles in front of, on side of,and/or behind vehicle 1000). In at least one embodiment, aforementionedfunctionality may be part of a cooperative adaptive cruise controlfunctionality of vehicle 1000.

In at least one embodiment, network interface 1024 may include an SoCthat provides modulation and demodulation functionality and enablescontroller(s) 1036 to communicate over wireless networks. In at leastone embodiment, network interface 1024 may include a radio frequencyfront-end for up-conversion from baseband to radio frequency, and downconversion from radio frequency to baseband. In at least one embodiment,frequency conversions may be performed in any technically feasiblefashion. For example, frequency conversions could be performed throughwell-known processes, and/or using super-heterodyne processes. In atleast one embodiment, radio frequency front end functionality may beprovided by a separate chip. In at least one embodiment, networkinterface may include wireless functionality for communicating over LTE,WCDMA, UMTS, GSM, CDMA2000, Bluetooth, Bluetooth LE, Wi-Fi, Z-Wave,ZigBee, LoRaWAN, and/or other wireless protocols.

In at least one embodiment, vehicle 1000 may further include datastore(s) 1028 which may include, without limitation, off-chip (e.g., offSoC(s) 1004) storage. In at least one embodiment, data store(s) 1028 mayinclude, without limitation, one or more storage elements including RAM,SRAM, dynamic random-access memory (“DRAM”), video random-access memory(“VRAM”), Flash, hard disks, and/or other components and/or devices thatmay store at least one bit of data.

In at least one embodiment, vehicle 1000 may further include GNSSsensor(s) 1058 (e.g., GPS and/or assisted GPS sensors), to assist inmapping, perception, occupancy grid generation, and/or path planningfunctions. In at least one embodiment, any number of GNSS sensor(s) 1058may be used, including, for example and without limitation, a GPS usinga USB connector with an Ethernet to Serial (e.g., RS-232) bridge.

In at least one embodiment, vehicle 1000 may further include RADARsensor(s) 1060. RADAR sensor(s) 1060 may be used by vehicle 1000 forlong-range vehicle detection, even in darkness and/or severe weatherconditions. In at least one embodiment, RADAR functional safety levelsmay be ASIL B. RADAR sensor(s) 1060 may use CAN and/or bus 1002 (e.g.,to transmit data generated by RADAR sensor(s) 1060) for control and toaccess object tracking data, with access to Ethernet to access raw datain some examples. In at least one embodiment, wide variety of RADARsensor types may be used. For example, and without limitation, RADARsensor(s) 1060 may be suitable for front, rear, and side RADAR use. Inat least one embodiment, one or more of RADAR sensors(s) 1060 are PulseDoppler RADAR sensor(s).

In at least one embodiment, RADAR sensor(s) 1060 may include differentconfigurations, such as long-range with narrow field of view,short-range with wide field of view, short-range side coverage, etc. Inat least one embodiment, long-range RADAR may be used for adaptivecruise control functionality. In at least one embodiment, long-rangeRADAR systems may provide a broad field of view realized by two or moreindependent scans, such as within a 250 m range. In at least oneembodiment, RADAR sensor(s) 1060 may help in distinguishing betweenstatic and moving objects, and may be used by ADAS system 1038 foremergency brake assist and forward collision warning. Sensors 1060(s)included in a long-range RADAR system may include, without limitation,monostatic multimodal RADAR with multiple (e.g., six or more) fixedRADAR antennae and a high-speed CAN and FlexRay interface. In at leastone embodiment, with six antennae, central four antennae may create afocused beam pattern, designed to record vehicle 1000's surroundings athigher speeds with minimal interference from traffic in adjacent lanes.In at least one embodiment, other two antennae may expand field of view,making it possible to quickly detect vehicles entering or leavingvehicle 1000's lane.

In at least one embodiment, mid-range RADAR systems may include, as anexample, a range of up to 160 m (front) or 80 m (rear), and a field ofview of up to 42 degrees (front) or 150 degrees (rear). In at least oneembodiment, short-range RADAR systems may include, without limitation,any number of RADAR sensor(s) 1060 designed to be installed at both endsof rear bumper. When installed at both ends of rear bumper, in at leastone embodiment, a RADAR sensor system may create two beams thatconstantly monitor blind spot in rear and next to vehicle. In at leastone embodiment, short-range RADAR systems may be used in ADAS system1038 for blind spot detection and/or lane change assist.

In at least one embodiment, vehicle 1000 may further include ultrasonicsensor(s) 1062. Ultrasonic sensor(s) 1062, which may be positioned atfront, back, and/or sides of vehicle 1000, may be used for park assistand/or to create and update an occupancy grid. In at least oneembodiment, a wide variety of ultrasonic sensor(s) 1062 may be used, anddifferent ultrasonic sensor(s) 1062 may be used for different ranges ofdetection (e.g., 2.5 m, 4 m). In at least one embodiment, ultrasonicsensor(s) 1062 may operate at functional safety levels of ASIL B.

In at least one embodiment, vehicle 1000 may include LIDAR sensor(s)1064. LIDAR sensor(s) 1064 may be used for object and pedestriandetection, emergency braking, collision avoidance, and/or otherfunctions. In at least one embodiment, LIDAR sensor(s) 1064 may befunctional safety level ASIL B. In at least one embodiment, vehicle 1000may include multiple LIDAR sensors 1064 (e.g., two, four, six, etc.)that may use Ethernet (e.g., to provide data to a Gigabit Ethernetswitch).

In at least one embodiment, LIDAR sensor(s) 1064 may be capable ofproviding a list of objects and their distances for a 360-degree fieldof view. In at least one embodiment, commercially available LIDARsensor(s) 1064 may have an advertised range of approximately 100 m, withan accuracy of 2 cm-3 cm, and with support for a 100 Mbps Ethernetconnection, for example. In at least one embodiment, one or morenon-protruding LIDAR sensors 1064 may be used. In such an embodiment,LIDAR sensor(s) 1064 may be implemented as a small device that may beembedded into front, rear, sides, and/or corners of vehicle 1000. In atleast one embodiment, LIDAR sensor(s) 1064, in such an embodiment, mayprovide up to a 120-degree horizontal and 35-degree verticalfield-of-view, with a 200 m range even for low-reflectivity objects. Inat least one embodiment, front-mounted LIDAR sensor(s) 1064 may beconfigured for a horizontal field of view between 45 degrees and 135degrees.

In at least one embodiment, LIDAR technologies, such as 3D flash LIDAR,may also be used. 3D Flash LIDAR uses a flash of a laser as atransmission source, to illuminate surroundings of vehicle 1000 up toapproximately 200 m. In at least one embodiment, a flash LIDAR unitincludes, without limitation, a receptor, which records laser pulsetransit time and reflected light on each pixel, which in turncorresponds to range from vehicle 1000 to objects. In at least oneembodiment, flash LIDAR may allow for highly accurate anddistortion-free images of surroundings to be generated with every laserflash. In at least one embodiment, four flash LIDAR sensors may bedeployed, one at each side of vehicle 1000. In at least one embodiment,3D flash LIDAR systems include, without limitation, a solid-state 3Dstaring array LIDAR camera with no moving parts other than a fan (e.g.,a non-scanning LIDAR device). In at least one embodiment, flash LIDARdevice(s) may use a 5 nanosecond class I (eye-safe) laser pulse perframe and may capture reflected laser light in form of 3D range pointclouds and co-registered intensity data.

In at least one embodiment, vehicle may further include IMU sensor(s)1066. In at least one embodiment, IMU sensor(s) 1066 may be located at acenter of rear axle of vehicle 1000, in at least one embodiment. In atleast one embodiment, IMU sensor(s) 1066 may include, for example andwithout limitation, accelerometer(s), magnetometer(s), gyroscope(s),magnetic compass(es), and/or other sensor types. In at least oneembodiment, such as in six-axis applications, IMU sensor(s) 1066 mayinclude, without limitation, accelerometers and gyroscopes. In at leastone embodiment, such as in nine-axis applications, IMU sensor(s) 1066may include, without limitation, accelerometers, gyroscopes, andmagnetometers.

In at least one embodiment, IMU sensor(s) 1066 may be implemented as aminiature, high performance GPS-Aided Inertial Navigation System(“GPS/INS”) that combines micro-electro-mechanical systems (“MEMS”)inertial sensors, a high-sensitivity GPS receiver, and advanced Kalmanfiltering algorithms to provide estimates of position, velocity, andattitude. In at least one embodiment, IMU sensor(s) 1066 may enablevehicle 1000 to estimate heading without requiring input from a magneticsensor by directly observing and correlating changes in velocity fromGPS to IMU sensor(s) 1066. In at least one embodiment, IMU sensor(s)1066 and GNSS sensor(s) 1058 may be combined in a single integratedunit.

In at least one embodiment, vehicle 1000 may include microphone(s) 1096placed in and/or around vehicle 1000. In at least one embodiment,microphone(s) 1096 may be used for emergency vehicle detection andidentification, among other things.

In at least one embodiment, vehicle 1000 may further include any numberof camera types, including stereo camera(s) 1068, wide-view camera(s)1070, infrared camera(s) 1072, surround camera(s) 1074, long-rangecamera(s) 1098, mid-range camera(s) 1076, and/or other camera types. Inat least one embodiment, cameras may be used to capture image dataaround an entire periphery of vehicle 1000. In at least one embodiment,types of cameras used depends on vehicle 1000. In at least oneembodiment, any combination of camera types may be used to providenecessary coverage around vehicle 1000. In at least one embodiment,number of cameras may differ depending on embodiment. For example, in atleast one embodiment, vehicle 1000 could include six cameras, sevencameras, ten cameras, twelve cameras, or another number of cameras.Cameras may support, as an example and without limitation, GigabitMultimedia Serial Link (“GMSL”) and/or Gigabit Ethernet. In at least oneembodiment, each of camera(s) is described with more detail previouslyherein with respect to FIG. 10A and FIG. 10B.

In at least one embodiment, vehicle 1000 may further include vibrationsensor(s) 1042. In at least one embodiment, vibration sensor(s) 1042 maymeasure vibrations of components of vehicle 1000, such as axle(s). Forexample, in at least one embodiment, changes in vibrations may indicatea change in road surfaces. In at least one embodiment, when two or morevibration sensors 1042 are used, differences between vibrations may beused to determine friction or slippage of road surface (e.g., whendifference in vibration is between a power-driven axle and a freelyrotating axle).

In at least one embodiment, vehicle 1000 may include ADAS system 1038.ADAS system 1038 may include, without limitation, an SoC, in someexamples. In at least one embodiment, ADAS system 1038 may include,without limitation, any number and combination of anautonomous/adaptive/automatic cruise control (“ACC”) system, acooperative adaptive cruise control (“CACC”) system, a forward crashwarning (“FCW”) system, an automatic emergency braking (“AEB”) system, alane departure warning (“LDW)” system, a lane keep assist (“LKA”)system, a blind spot warning (“BSW”) system, a rear cross-trafficwarning (“RCTW”) system, a collision warning (“CW”) system, a lanecentering (“LC”) system, and/or other systems, features, and/orfunctionality.

In at least one embodiment, ACC system may use RADAR sensor(s) 1060,LIDAR sensor(s) 1064, and/or any number of camera(s). In at least oneembodiment, ACC system may include a longitudinal ACC system and/or alateral ACC system. In at least one embodiment, longitudinal ACC systemmonitors and controls distance to vehicle immediately ahead of vehicle1000 and automatically adjust speed of vehicle 1000 to maintain a safedistance from vehicles ahead. In at least one embodiment, lateral ACCsystem performs distance keeping, and advises vehicle 1000 to changelanes when necessary. In at least one embodiment, lateral ACC is relatedto other ADAS applications such as LC and CW.

In at least one embodiment, CACC system uses information from othervehicles that may be received via network interface 1024 and/or wirelessantenna(s) 1026 from other vehicles via a wireless link, or indirectly,over a network connection (e.g., over Internet). In at least oneembodiment, direct links may be provided by a vehicle-to-vehicle (“V2V”)communication link, while indirect links may be provided by aninfrastructure-to-vehicle (“I2V”) communication link. In general, V2Vcommunication concept provides information about immediately precedingvehicles (e.g., vehicles immediately ahead of and in same lane asvehicle 1000), while I2V communication concept provides informationabout traffic further ahead. In at least one embodiment, CACC system mayinclude either or both I2V and V2V information sources. In at least oneembodiment, given information of vehicles ahead of vehicle 1000, CACCsystem may be more reliable and it has potential to improve traffic flowsmoothness and reduce congestion on road.

In at least one embodiment, FCW system is designed to alert driver to ahazard, so that driver may take corrective action. In at least oneembodiment, FCW system uses a front-facing camera and/or RADAR sensor(s)1060, coupled to a dedicated processor, DSP, FPGA, and/or ASIC, that iselectrically coupled to driver feedback, such as a display, speaker,and/or vibrating component. In at least one embodiment, FCW system mayprovide a warning, such as in form of a sound, visual warning, vibrationand/or a quick brake pulse.

In at least one embodiment, AEB system detects an impending forwardcollision with another vehicle or other object, and may automaticallyapply brakes if driver does not take corrective action within aspecified time or distance parameter. In at least one embodiment, AEBsystem may use front-facing camera(s) and/or RADAR sensor(s) 1060,coupled to a dedicated processor, DSP, FPGA, and/or ASIC. In at leastone embodiment, when AEB system detects a hazard, AEB system typicallyfirst alerts driver to take corrective action to avoid collision and, ifdriver does not take corrective action, AEB system may automaticallyapply brakes in an effort to prevent, or at least mitigate, impact ofpredicted collision. In at least one embodiment, AEB system, may includetechniques such as dynamic brake support and/or crash imminent braking.

In at least one embodiment, LDW system provides visual, audible, and/ortactile warnings, such as steering wheel or seat vibrations, to alertdriver when vehicle 1000 crosses lane markings. In at least oneembodiment, LDW system does not activate when driver indicates anintentional lane departure, by activating a turn signal. In at least oneembodiment, LDW system may use front-side facing cameras, coupled to adedicated processor, DSP, FPGA, and/or ASIC, that is electricallycoupled to driver feedback, such as a display, speaker, and/or vibratingcomponent. In at least one embodiment, LKA system is a variation of LDWsystem. LKA system provides steering input or braking to correct vehicle1000 if vehicle 1000 starts to exit lane.

In at least one embodiment, BSW system detects and warns driver ofvehicles in an automobile's blind spot. In at least one embodiment, BSWsystem may provide a visual, audible, and/or tactile alert to indicatethat merging or changing lanes is unsafe. In at least one embodiment,BSW system may provide an additional warning when driver uses a turnsignal. In at least one embodiment, BSW system may use rear-side facingcamera(s) and/or RADAR sensor(s) 1060, coupled to a dedicated processor,DSP, FPGA, and/or ASIC, that is electrically coupled to driver feedback,such as a display, speaker, and/or vibrating component.

In at least one embodiment, RCTW system may provide visual, audible,and/or tactile notification when an object is detected outsiderear-camera range when vehicle 1000 is backing up. In at least oneembodiment, RCTW system includes AEB system to ensure that vehiclebrakes are applied to avoid a crash. In at least one embodiment, RCTWsystem may use one or more rear-facing RADAR sensor(s) 1060, coupled toa dedicated processor, DSP, FPGA, and/or ASIC, that is electricallycoupled to driver feedback, such as a display, speaker, and/or vibratingcomponent.

In at least one embodiment, conventional ADAS systems may be prone tofalse positive results which may be annoying and distracting to adriver, but typically are not catastrophic, because conventional ADASsystems alert driver and allow driver to decide whether a safetycondition truly exists and act accordingly. In at least one embodiment,vehicle 1000 itself decides, in case of conflicting results, whether toheed result from a primary computer or a secondary computer (e.g., firstcontroller 1036 or second controller 1036). For example, in at least oneembodiment, ADAS system 1038 may be a backup and/or secondary computerfor providing perception information to a backup computer rationalitymodule. In at least one embodiment, backup computer rationality monitormay run a redundant diverse software on hardware components to detectfaults in perception and dynamic driving tasks. In at least oneembodiment, outputs from ADAS system 1038 may be provided to asupervisory MCU. In at least one embodiment, if outputs from primarycomputer and secondary computer conflict, supervisory MCU determines howto reconcile conflict to ensure safe operation.

In at least one embodiment, primary computer may be configured toprovide supervisory MCU with a confidence score, indicating primarycomputer's confidence in chosen result. In at least one embodiment, ifconfidence score exceeds a threshold, supervisory MCU may follow primarycomputer's direction, regardless of whether secondary computer providesa conflicting or inconsistent result. In at least one embodiment, whereconfidence score does not meet threshold, and where primary andsecondary computer indicate different results (e.g., a conflict),supervisory MCU may arbitrate between computers to determine appropriateoutcome.

In at least one embodiment, supervisory MCU may be configured to run aneural network(s) that is trained and configured to determine, based atleast in part on outputs from primary computer and secondary computer,conditions under which secondary computer provides false alarms. In atleast one embodiment, neural network(s) in supervisory MCU may learnwhen secondary computer's output may be trusted, and when it cannot. Forexample, in at least one embodiment, when secondary computer is aRADAR-based FCW system, a neural network(s) in supervisory MCU may learnwhen FCW system is identifying metallic objects that are not, in fact,hazards, such as a drainage grate or manhole cover that triggers analarm. In at least one embodiment, when secondary computer is acamera-based LDW system, a neural network in supervisory MCU may learnto override LDW when bicyclists or pedestrians are present and a lanedeparture is, in fact, safest maneuver. In at least one embodiment,supervisory MCU may include at least one of a DLA or GPU suitable forrunning neural network(s) with associated memory. In at least oneembodiment, supervisory MCU may comprise and/or be included as acomponent of SoC(s) 1004.

In at least one embodiment, ADAS system 1038 may include a secondarycomputer that performs ADAS functionality using traditional rules ofcomputer vision. In at least one embodiment, secondary computer may useclassic computer vision rules (if-then), and presence of a neuralnetwork(s) in supervisory MCU may improve reliability, safety andperformance. For example, in at least one embodiment, diverseimplementation and intentional non-identity makes overall system morefault-tolerant, especially to faults caused by software (orsoftware-hardware interface) functionality. For example, in at least oneembodiment, if there is a software bug or error in software running onprimary computer, and non-identical software code running on secondarycomputer provides same overall result, then supervisory MCU may havegreater confidence that overall result is correct, and bug in softwareor hardware on primary computer is not causing material error.

In at least one embodiment, output of ADAS system 1038 may be fed intoprimary computer's perception block and/or primary computer's dynamicdriving task block. For example, in at least one embodiment, if ADASsystem 1038 indicates a forward crash warning due to an objectimmediately ahead, perception block may use this information whenidentifying objects. In at least one embodiment, secondary computer mayhave its own neural network which is trained and thus reduces risk offalse positives, as described herein.

In at least one embodiment, vehicle 1000 may further includeinfotainment SoC 1030 (e.g., an in-vehicle infotainment system (IVI)).Although illustrated and described as an SoC, infotainment system 1030,in at least one embodiment, may not be an SoC, and may include, withoutlimitation, two or more discrete components. In at least one embodiment,infotainment SoC 1030 may include, without limitation, a combination ofhardware and software that may be used to provide audio (e.g., music, apersonal digital assistant, navigational instructions, news, radio,etc.), video (e.g., TV, movies, streaming, etc.), phone (e.g.,hands-free calling), network connectivity (e.g., LTE, WiFi, etc.),and/or information services (e.g., navigation systems, rear-parkingassistance, a radio data system, vehicle related information such asfuel level, total distance covered, brake fuel level, oil level, dooropen/close, air filter information, etc.) to vehicle 1000. For example,infotainment SoC 1030 could include radios, disk players, navigationsystems, video players, USB and Bluetooth connectivity, carputers,in-car entertainment, WiFi, steering wheel audio controls, hands freevoice control, a heads-up display (“HUD”), HMI display 1034, atelematics device, a control panel (e.g., for controlling and/orinteracting with various components, features, and/or systems), and/orother components. In at least one embodiment, infotainment SoC 1030 mayfurther be used to provide information (e.g., visual and/or audible) touser(s) of vehicle, such as information from ADAS system 1038,autonomous driving information such as planned vehicle maneuvers,trajectories, surrounding environment information (e.g., intersectioninformation, vehicle information, road information, etc.), and/or otherinformation.

In at least one embodiment, infotainment SoC 1030 may include any amountand type of GPU functionality. In at least one embodiment, infotainmentSoC 1030 may communicate over bus 1002 (e.g., CAN bus, Ethernet, etc.)with other devices, systems, and/or components of vehicle 1000. In atleast one embodiment, infotainment SoC 1030 may be coupled to asupervisory MCU such that GPU of infotainment system may perform someself-driving functions in event that primary controller(s) 1036 (e.g.,primary and/or backup computers of vehicle 1000) fail. In at least oneembodiment, infotainment SoC 1030 may put vehicle 1000 into a chauffeurto safe stop mode, as described herein.

In at least one embodiment, vehicle 1000 may further include instrumentcluster 1032 (e.g., a digital dash, an electronic instrument cluster, adigital instrument panel, etc.). In at least one embodiment, instrumentcluster 1032 may include, without limitation, a controller and/orsupercomputer (e.g., a discrete controller or supercomputer). In atleast one embodiment, instrument cluster 1032 may include, withoutlimitation, any number and combination of a set of instrumentation suchas a speedometer, fuel level, oil pressure, tachometer, odometer, turnindicators, gearshift position indicator, seat belt warning light(s),parking-brake warning light(s), engine-malfunction light(s),supplemental restraint system (e.g., airbag) information, lightingcontrols, safety system controls, navigation information, etc. In someexamples, information may be displayed and/or shared among infotainmentSoC 1030 and instrument cluster 1032. In at least one embodiment,instrument cluster 1032 may be included as part of infotainment SoC1030, or vice versa.

Communication engine and coalescing agent logic 904 are used to performautomatic coalescing of communication requests over a network. Detailsregarding communication engine and coalescing agent logic 904 areprovided above with respect to FIGS. 1-8 . In at least one embodiment,communication engine and coalescing agent logic 904 may be used insystem FIG. 10C for automatic coalescing operations for computingsystems using neural network training operations, neural networkfunctions and/or architectures, or neural network use cases describedherein.

Such components can be used to generate synthetic data imitating failurecases in a network training process, which can help to improveperformance of the network while limiting the amount of synthetic datato avoid overfitting.

FIG. 10D is a diagram of a system 1076 for communication betweencloud-based server(s) and autonomous vehicle 1000 of FIG. 10A, accordingto at least one embodiment. In at least one embodiment, system 1076 mayinclude, without limitation, server(s) 1078, network(s) 1090, and anynumber and type of vehicles, including vehicle 1000. In at least oneembodiment, server(s) 1078 may include, without limitation, a pluralityof GPUs 1084(A)-1084(H) (collectively referred to herein as GPUs 1084),PCIe switches 1082(A)-1082(D) (collectively referred to herein as PCIeswitches 1082), and/or CPUs 1080(A)-1080(B) (collectively referred toherein as CPUs 1080). GPUs 1084, CPUs 1080, and PCIe switches 1082 maybe interconnected with high-speed interconnects such as, for example andwithout limitation, NVLink interfaces 1088 developed by NVIDIA and/orPCIe connections 1086. In at least one embodiment, GPUs 1084 areconnected via an NVLink and/or NVSwitch SoC and GPUs 1084 and PCIeswitches 1082 are connected via PCIe interconnects. In at least oneembodiment, although eight GPUs 1084, two CPUs 1080, and four PCIeswitches 1082 are illustrated, this is not intended to be limiting. Inat least one embodiment, each of server(s) 1078 may include, withoutlimitation, any number of GPUs 1084, CPUs 1080, and/or PCIe switches1082, in any combination. For example, in at least one embodiment,server(s) 1078 could each include eight, sixteen, thirty-two, and/ormore GPUs 1084.

In at least one embodiment, server(s) 1078 may receive, over network(s)1090 and from vehicles, image data representative of images showingunexpected or changed road conditions, such as recently commencedroad-work. In at least one embodiment, server(s) 1078 may transmit, overnetwork(s) 1090 and to vehicles, neural networks 1092, updated neuralnetworks 1092, and/or map information 1094, including, withoutlimitation, information regarding traffic and road conditions. In atleast one embodiment, updates to map information 1094 may include,without limitation, updates for HD map 1022, such as informationregarding construction sites, potholes, detours, flooding, and/or otherobstructions. In at least one embodiment, neural networks 1092, updatedneural networks 1092, and/or map information 1094 may have resulted fromnew training and/or experiences represented in data received from anynumber of vehicles in environment, and/or based at least in part ontraining performed at a data center (e.g., using server(s) 1078 and/orother servers).

In at least one embodiment, server(s) 1078 may be used to train machinelearning models (e.g., neural networks) based at least in part ontraining data. In at least one embodiment, training data may begenerated by vehicles, and/or may be generated in a simulation (e.g.,using a game engine). In at least one embodiment, any amount of trainingdata is tagged (e.g., where associated neural network benefits fromsupervised learning) and/or undergoes other pre-processing. In at leastone embodiment, any amount of training data is not tagged and/orpre-processed (e.g., where associated neural network does not requiresupervised learning). In at least one embodiment, once machine learningmodels are trained, machine learning models may be used by vehicles(e.g., transmitted to vehicles over network(s) 1090, and/or machinelearning models may be used by server(s) 1078 to remotely monitorvehicles.

In at least one embodiment, server(s) 1078 may receive data fromvehicles and apply data to up-to-date real-time neural networks forreal-time intelligent inferencing. In at least one embodiment, server(s)1078 may include deep-learning supercomputers and/or dedicated AIcomputers powered by GPU(s) 1084, such as a DGX and DGX Station machinesdeveloped by NVIDIA. However, in at least one embodiment, server(s) 1078may include deep learning infrastructure that use CPU-powered datacenters.

In at least one embodiment, deep-learning infrastructure of server(s)1078 may be capable of fast, real-time inferencing, and may use thatcapability to evaluate and verify health of processors, software, and/orassociated hardware in vehicle 1000. For example, in at least oneembodiment, deep-learning infrastructure may receive periodic updatesfrom vehicle 1000, such as a sequence of images and/or objects thatvehicle 1000 has located in that sequence of images (e.g., via computervision and/or other machine learning object classification techniques).In at least one embodiment, deep-learning infrastructure may run its ownneural network to identify objects and compare them with objectsidentified by vehicle 1000 and, if results do not match anddeep-learning infrastructure concludes that AI in vehicle 1000 ismalfunctioning, then server(s) 1078 may transmit a signal to vehicle1000 instructing a fail-safe computer of vehicle 1000 to assume control,notify passengers, and complete a safe parking maneuver.

In at least one embodiment, server(s) 1078 may include GPU(s) 1084 andone or more programmable inference accelerators (e.g., NVIDIA's TensorRT3). In at least one embodiment, combination of GPU-powered servers andinference acceleration may make real-time responsiveness possible. In atleast one embodiment, such as where performance is less critical,servers powered by CPUs, FPGAs, and other processors may be used forinferencing.

Communication engine and coalescing agent logic 904 are used to performautomatic coalescing of communication requests over a network. Detailsregarding communication engine and coalescing agent logic 904 areprovided above with respect to FIGS. 1-8 . In at least one embodiment,communication engine and coalescing agent logic 904 may be used insystem FIG. 10D for automatic coalescing operations for computingsystems using neural network training operations, neural networkfunctions and/or architectures, or neural network use cases describedherein.

Computer Systems

FIG. 11 illustrates a computer system 1100, according to at least oneembodiment. In at least one embodiment, computer system 1100 isconfigured to implement various processes and methods describedthroughout this disclosure. Communication engine and coalescing agentlogic 904 are used to perform automatic coalescing of communicationrequests over a network via network interface 1122. Details regardingcommunication engine and coalescing agent logic 904 are provided abovewith respect to FIGS. 1-8 . In at least one embodiment, communicationengine and coalescing agent logic 904 may be used in system FIG. 11 forautomatic coalescing operations for computing systems using neuralnetwork training operations, neural network functions and/orarchitectures, or neural network use cases described herein.

In at least one embodiment, computer system 1100 comprises, withoutlimitation, at least one central processing unit (“CPU”) 1102 that isconnected to a communication bus 1110 implemented using any suitableprotocol, such as PCI (“Peripheral Component Interconnect”), peripheralcomponent interconnect express (“PCI-Express”), AGP (“AcceleratedGraphics Port”), HyperTransport, or any other bus or point-to-pointcommunication protocol(s). In at least one embodiment, computer system1100 includes, without limitation, a main memory 1104 and control logic(e.g., implemented as hardware, software, or a combination thereof) anddata are stored in main memory 1104, which may take form of randomaccess memory (“RAM”). In at least one embodiment, a network interfacesubsystem (“network interface”) 1122 provides an interface to othercomputing devices and networks for receiving data from and transmittingdata to other systems with computer system 1100.

In at least one embodiment, computer system 1100, in at least oneembodiment, includes, without limitation, input devices 1108, a parallelprocessing system 1112, and display devices 1106 that can be implementedusing a conventional cathode ray tube (“CRT”), a liquid crystal display(“LCD”), a light emitting diode (“LED”) display, a plasma display, orother suitable display technologies. In at least one embodiment, userinput is received from input devices 1108 such as keyboard, mouse,touchpad, microphone, etc. In at least one embodiment, each moduledescribed herein can be situated on a single semiconductor platform toform a processing system.

Communication engine and coalescing agent logic 904 are used to performautomatic coalescing of communication requests over a network. Detailsregarding communication engine and coalescing agent logic 904 areprovided above with respect to FIGS. 1-8 . In at least one embodiment,communication engine and coalescing agent logic 904 may be used insystem FIG. 11 for automatic coalescing operations for computing systemsusing neural network training operations, neural network functionsand/or architectures, or neural network use cases described herein.

At least one embodiment of the disclosure can be described in view ofthe following clauses:

In clause 1, a method comprising: receiving, from a shared memoryapplication executing on a first graphics processing unit (GPU), a firstcommunication request having a second GPU as a destination; determiningthat the first communication request satisfies a coalescing criterion;storing the first communication request in association with a group ofrequests that have a common property; determining that a timerassociated with the group of requests expires or a size of the groupsatisfies a group size criterion; coalescing the group of requests intoa coalesced request; and transporting the coalesced request to thesecond GPU over a network.

In clause 2, the method of clause 1, further comprising: receiving, fromthe shared memory application, a second communication request, whereinthe first communication request originates from a first group of threadsof the first GPU, and wherein the second communication requestoriginates from a second group of threads of the first GPU; determiningthat the second communication request satisfies the coalescingcriterion; and storing the second communication request in associationwith the group of requests that have the common property.

In clause 3, the method of clause 1, further comprising: receiving, fromthe shared memory application, a second communication request having athird GPU as a destination; determining that the second communicationrequest does not satisfy the coalescing criterion, wherein the secondcommunication request is transportable via a peer-to-peer (P2P)connection with the third GPU; and transporting the second communicationrequest to the third GPU over the P2P connection.

In clause 4, the method of clause 1, wherein determining that the firstcommunication request satisfies the coalescing criterion comprisesdetermining that the first communication request satisfies at least oneof a request size criterion, a latency criterion, or a peer-to-peer(P2P) connectivity criterion.

In clause 5, the method of clause 1, wherein the common property is atleast one of a same operation type, a same network destination, a sameGPU destination, or adjacent memory locations.

In clause 6, the method of clause 1, further comprising: receiving, fromthe shared memory application, a second communication request, whereinthe second communication request originates from a group of threads ofthe first GPU and has a third GPU as a destination; performinggroup-level coalescing of the second communication request with othercommunication requests from the group of threads to obtain a group-levelrequest; determining that the group-level request satisfies thecoalescing criterion; storing the group-level request in associationwith a second group of requests that have a common property; determiningthat a second timer associated with the second group of requests expiresor a size of the second group satisfies the group size criterion;coalescing the second group of requests into a second coalesced request;and transporting the second coalesced request to the third GPU.

In clause 7, the method of clause 1, wherein at least one of thereceiving, the determining the first communication request satisfied acoalescing criterion, the storing, the determining a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion, the coalescing, or the transporting is executed bya software communication engine implemented using the first GPU.

In clause 8, the method of clause 1, wherein at least one of thereceiving, the determining the first communication request satisfied acoalescing criterion, the storing, the determining a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion, the coalescing, or the transporting is executed bya software communication engine implemented using a first kernel in thefirst GPU, wherein the shared memory application is executed using asecond kernel in the first GPU.

In clause 9, the method of clause 1, wherein at least one of thereceiving, the determining the first communication request satisfied acoalescing criterion, the storing, the determining a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion, the coalescing, or the transporting is executed bya communication engine implemented as hardware logic using a hardwareoffload circuit coupled to the first GPU.

In clause 10, the method of clause 1, wherein at least one of thereceiving, the determining the first communication request satisfied acoalescing criterion, the storing, the determining a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion, the coalescing, or the transporting is executed bya communication engine implemented as a software communication engineusing a central processing unit (CPU) operatively coupled to the firstGPU.

In clause 11, the method of clause 1, wherein at least one of thereceiving, the determining the first communication request satisfied acoalescing criterion, the storing, the determining a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion, the coalescing, or the transporting is executed bya communication engine implemented using a software communication enginein a third GPU coupled to the first GPU.

In clause 12, a system comprising: a memory device; a central processingunit (CPU); and a first graphics processing unit (GPU) operativelycoupled to the memory device and the CPU, the first GPU to execute acommunication engine, wherein the communication engine is to: receive,from a shared memory application, a first communication request having asecond GPU as a destination; determine that the first communicationrequest satisfies a coalescing criterion; store the first communicationrequest in association with a group of requests that have a commonproperty; determine that a timer associated with the group of requestsexpires or a size of the group satisfies a group size criterion;coalesce the group of requests into a coalesced request; and transportthe coalesced request to the second GPU over a network.

In clause 13, the system of clause 12, wherein the communication engineis further to: receive, from the shared memory application, a secondcommunication request, wherein the first communication requestoriginates from a first group of threads of the first GPU, and whereinthe second communication request originates from a second group ofthreads of the first GPU; determine that the second communicationrequest satisfies the coalescing criterion; and store the secondcommunication request in association with the group of requests thathave the common property.

In clause 14, the system of clause 12, wherein the communication engineis further to: receive, from the shared memory application, a secondcommunication request having a third GPU as a destination; determinethat the second communication request does not satisfy the coalescingcriterion, wherein the second communication request is transportable viaa peer-to-peer (P2P) connection with the third GPU; and transport thesecond communication request to the third GPU over the P2P connection.

In clause 15, the system of clause 12, wherein the communication engineis to determine that the first communication request satisfies thecoalescing criterion by determining that the first communication requestsatisfies at least one of a request size criterion, a latency criterion,or a peer-to-peer (P2P) connectivity criterion.

In clause 16, the system of clause 12, wherein the common property is atleast one of a same operation type, a same network destination, a sameGPU destination, or adjacent memory locations.

In clause 17, the system of clause 12, wherein the communication engineis further to: receive, from the shared memory application, a secondcommunication request, wherein the second communication requestoriginates from a group of threads of the first GPU and has a third GPUas a destination; perform group-level coalescing of the secondcommunication request with other communication requests from the groupof threads to obtain a group-level request; determine that thegroup-level request satisfies the coalescing criterion; store thegroup-level request in association with a second group of requests thathave a common property; determine that a second timer associated withthe second group of requests expires or a size of the second groupsatisfies the group size criterion; coalesce the second group ofrequests into a second coalesced request; and transport the secondcoalesced request to the third GPU.

In clause 18, a computing system comprising: a memory device; a firstgraphics processing unit (GPU) operatively coupled to the memory device,the first GPU comprising a communication engine, wherein thecommunication engine is to: receive, from a shared memory application, afirst communication request having a second GPU as a destination;determine that the first communication request satisfies a coalescingcriterion; store the first communication request in association with agroup of requests that have a common property; determine that a timerassociated with the group of requests expires or a size of the groupsatisfies a group size criterion; coalesce the group of requests into acoalesced request; and transport the coalesced request to the second GPUover a network.

In clause 19, the computing system of clause 18, wherein thecommunication engine is a hardware offload circuit coupled to the firstGPU.

In clause 20, the computing system of clause 18, wherein the first GPUexecutes the shared memory application and the communication engine.

Other variations are within spirit of present disclosure. Thus, whiledisclosed techniques are susceptible to various modifications andalternative constructions, certain illustrated embodiments thereof areshown in drawings and have been described above in detail. It should beunderstood, however, that there is no intention to limit disclosure tospecific form or forms disclosed, but on contrary, intention is to coverall modifications, alternative constructions, and equivalents fallingwithin spirit and scope of disclosure, as defined in appended claims.

Use of terms “a” and “an” and “the” and similar referents in context ofdescribing disclosed embodiments (especially in context of followingclaims) are to be construed to cover both singular and plural, unlessotherwise indicated herein or clearly contradicted by context, and notas a definition of a term. Terms “comprising,” “having,” “including,”and “containing” are to be construed as open-ended terms (meaning“including, but not limited to,”) unless otherwise noted. “Connected,”when unmodified and referring to physical connections, is to beconstrued as partly or wholly contained within, attached to, or joinedtogether, even if there is something intervening. Recitation of rangesof values herein are merely intended to serve as a shorthand method ofreferring individually to each separate value falling within range,unless otherwise indicated herein and each separate value isincorporated into specification as if it were individually recitedherein. In at least one embodiment, use of term “set” (e.g., “a set ofitems”) or “subset” unless otherwise noted or contradicted by context,is to be construed as a nonempty collection comprising one or moremembers. Further, unless otherwise noted or contradicted by context,term “subset” of a corresponding set does not necessarily denote aproper subset of corresponding set, but subset and corresponding set maybe equal.

Conjunctive language, such as phrases of form “at least one of A, B, andC,” or “at least one of A, B and C,” unless specifically statedotherwise or otherwise clearly contradicted by context, is otherwiseunderstood with context as used in general to present that an item,term, etc., may be either A or B or C, or any nonempty subset of set ofA and B and C. For instance, in illustrative example of a set havingthree members, conjunctive phrases “at least one of A, B, and C” and “atleast one of A, B and C” refer to any of following sets: {A}, {B}, {C},{A, B}, {A, C}, {B, C}, {A, B, C}. Thus, such conjunctive language isnot generally intended to imply that certain embodiments require atleast one of A, at least one of B and at least one of C each to bepresent. In addition, unless otherwise noted or contradicted by context,term “plurality” indicates a state of being plural (e.g., “a pluralityof items” indicates multiple items). In at least one embodiment, numberof items in a plurality is at least two, but can be more when soindicated either explicitly or by context. Further, unless statedotherwise or otherwise clear from context, phrase “based on” means“based at least in part on” and not “based solely on.”

Operations of processes described herein can be performed in anysuitable order unless otherwise indicated herein or otherwise clearlycontradicted by context. In at least one embodiment, a process such asthose processes described herein (or variations and/or combinationsthereof) is performed under control of one or more computer systemsconfigured with executable instructions and is implemented as code(e.g., executable instructions, one or more computer programs or one ormore applications) executing collectively on one or more processors, byhardware or combinations thereof. In at least one embodiment, code isstored on a computer-readable storage medium, for example, in form of acomputer program comprising a plurality of instructions executable byone or more processors. In at least one embodiment, a computer-readablestorage medium is a non-transitory computer-readable storage medium thatexcludes transitory signals (e.g., a propagating transient electric orelectromagnetic transmission) but includes non-transitory data storagecircuitry (e.g., buffers, cache, and queues) within transceivers oftransitory signals. In at least one embodiment, code (e.g., executablecode or source code) is stored on a set of one or more non-transitorycomputer-readable storage media having stored thereon executableinstructions (or other memory to store executable instructions) that,when executed (i.e., as a result of being executed) by one or moreprocessors of a computer system, cause computer system to performoperations described herein. In at least one embodiment, set ofnon-transitory computer-readable storage media comprises multiplenon-transitory computer-readable storage media and one or more ofindividual non-transitory storage media of multiple non-transitorycomputer-readable storage media lack all of code while multiplenon-transitory computer-readable storage media collectively store all ofcode. In at least one embodiment, executable instructions are executedsuch that different instructions are executed by differentprocessors—for example, a non-transitory computer-readable storagemedium store instructions and a main central processing unit (“CPU”)executes some of instructions while a graphics processing unit (“GPU”)and/or a data processing unit (“DPU”)—potentially in conjunction with aGPU)—executes other instructions. In at least one embodiment, differentcomponents of a computer system have separate processors and differentprocessors execute different subsets of instructions.

Accordingly, in at least one embodiment, computer systems are configuredto implement one or more services that singly or collectively performoperations of processes described herein and such computer systems areconfigured with applicable hardware and/or software that enableperformance of operations. Further, a computer system that implements atleast one embodiment of present disclosure is a single device and, inanother embodiment, is a distributed computer system comprising multipledevices that operate differently such that distributed computer systemperforms operations described herein and such that a single device doesnot perform all operations.

Use of any and all examples, or exemplary language (e.g., “such as”)provided herein, is intended merely to better illuminate embodiments ofdisclosure and does not pose a limitation on scope of disclosure unlessotherwise claimed. No language in specification should be construed asindicating any non-claimed element as essential to practice ofdisclosure.

All references, including publications, patent applications, andpatents, cited herein are hereby incorporated by reference to sameextent as if each reference were individually and specifically indicatedto be incorporated by reference and were set forth in its entiretyherein.

In description and claims, terms “coupled” and “connected,” along withtheir derivatives, may be used. It should be understood that these termsmay be not intended as synonyms for each other. Rather, in particularexamples, “connected” or “coupled” may be used to indicate that two ormore elements are in direct or indirect physical or electrical contactwith each other. “Coupled” may also mean that two or more elements arenot in direct contact with each other, but yet still co-operate orinteract with each other.

Unless specifically stated otherwise, it may be appreciated thatthroughout specification terms such as “processing,” “computing,”“calculating,” “determining,” or like, refer to action and/or processesof a computer or computing system, or similar electronic computingdevice, that manipulate and/or transform data represented as physical,such as electronic, quantities within computing system's registersand/or memories into other data similarly represented as physicalquantities within computing system's memories, registers or other suchinformation storage, transmission or display devices.

In a similar manner, term “processor” may refer to any device or portionof a device that processes electronic data from registers and/or memoryand transform that electronic data into other electronic data that maybe stored in registers and/or memory. As non-limiting examples,“processor” may be a CPU or a GPU. A “computing platform” may compriseone or more processors. As used herein, “software” processes mayinclude, for example, software and/or hardware entities that performwork over time, such as tasks, threads, and intelligent agents. Also,each process may refer to multiple processes, for carrying outinstructions in sequence or in parallel, continuously or intermittently.In at least one embodiment, terms “system” and “method” are used hereininterchangeably insofar as system may embody one or more methods andmethods may be considered a system.

In present document, references may be made to obtaining, acquiring,receiving, or inputting analog or digital data into a subsystem,computer system, or computer-implemented machine. In at least oneembodiment, process of obtaining, acquiring, receiving, or inputtinganalog and digital data can be accomplished in a variety of ways such asby receiving data as a parameter of a function call or a call to anapplication programming interface. In at least one embodiment, processesof obtaining, acquiring, receiving, or inputting analog or digital datacan be accomplished by transferring data via a serial or parallelinterface. In at least one embodiment, processes of obtaining,acquiring, receiving, or inputting analog or digital data can beaccomplished by transferring data via a computer network from providingentity to acquiring entity. In at least one embodiment, references mayalso be made to providing, outputting, transmitting, sending, orpresenting analog or digital data. In various examples, processes ofproviding, outputting, transmitting, sending, or presenting analog ordigital data can be accomplished by transferring data as an input oroutput parameter of a function call, a parameter of an applicationprogramming interface or interprocess communication mechanism.

Although descriptions herein set forth example embodiments of describedtechniques, other architectures may be used to implement describedfunctionality, and are intended to be within scope of this disclosure.Furthermore, although specific distributions of responsibilities may bedefined above for purposes of description, various functions andresponsibilities might be distributed and divided in different ways,depending on circumstances.

Furthermore, although subject matter has been described in languagespecific to structural features and/or methodological acts, it is to beunderstood that subject matter claimed in appended claims is notnecessarily limited to specific features or acts described. Rather,specific features and acts are disclosed as exemplary forms ofimplementing the claims.

What is claimed is:
 1. A method comprising: receiving, from a sharedmemory application executing on a first graphics processing unit (GPU),a first communication request having a second GPU as a destination;determining whether the first communication request satisfies acoalescing criterion for network transport over a network to the secondGPU; transporting, in response to determining the first communicationrequest does not satisfy the coalescing criterion, the firstcommunication request to the second GPU over a peer-to-peer (P2P)connection between the first and second GPUs; in response to determiningthe first communication request satisfies the coalescing criterion,storing the first communication request in association with a group ofrequests that have a common property; determining that a timerassociated with the group of requests expires or a size of the groupsatisfies a group size criterion; coalescing the group of requests intoa coalesced request; and transporting the coalesced request to thesecond GPU over the network.
 2. The method of claim 1, furthercomprising: receiving, from the shared memory application, a secondcommunication request, wherein the first communication requestoriginates from a first group of threads of the first GPU, and whereinthe second communication request originates from a second group ofthreads of the first GPU; determining that the second communicationrequest satisfies the coalescing criterion; and storing the secondcommunication request in association with the group of requests thathave the common property.
 3. The method of claim 1, further comprising:receiving, from the shared memory application, a second communicationrequest having a third GPU as a destination; determining that the secondcommunication request does not satisfy the coalescing criterion, whereinthe second communication request is transportable via a P2P connectionwith the third GPU; and transporting the second communication request tothe third GPU over the P2P connection.
 4. The method of claim 1, whereindetermining that the first communication request satisfies thecoalescing criterion comprises determining that the first communicationrequest satisfies at least one of a request size criterion, a latencycriterion, or a P2P connectivity criterion.
 5. The method of claim 1,wherein the common property is at least one of a same operation type, asame network destination, a same GPU destination, or adjacent memorylocations.
 6. The method of claim 1, further comprising: receiving, fromthe shared memory application, a second communication request, whereinthe second communication request originates from a group of threads ofthe first GPU and has a third GPU as a destination; performinggroup-level coalescing of the second communication request with othercommunication requests from the group of threads to obtain a group-levelrequest; determining that the group-level request satisfies thecoalescing criterion; storing the group-level request in associationwith a second group of requests that have a common property; determiningthat a second timer associated with the second group of requests expiresor a size of the second group satisfies the group size criterion;coalescing the second group of requests into a second coalesced request;and transporting the second coalesced request to the third GPU.
 7. Themethod of claim 1, wherein at least one of the receiving, thedetermining the first communication request satisfied a coalescingcriterion, the storing, the determining a timer associated with thegroup of requests expires or a size of the group satisfies a group sizecriterion, the coalescing, or the transporting is executed by a softwarecommunication engine implemented using the first GPU.
 8. The method ofclaim 1, wherein at least one of the receiving, the determining thefirst communication request satisfied a coalescing criterion, thestoring, the determining a timer associated with the group of requestsexpires or a size of the group satisfies a group size criterion, thecoalescing, or the transporting is executed by a software communicationengine implemented using a first kernel in the first GPU, wherein theshared memory application is executed using a second kernel in the firstGPU.
 9. The method of claim 1, wherein at least one of the receiving,the determining the first communication request satisfied a coalescingcriterion, the storing, the determining a timer associated with thegroup of requests expires or a size of the group satisfies a group sizecriterion, the coalescing, or the transporting is executed by acommunication engine implemented as hardware logic using a hardwareoffload circuit coupled to the first GPU.
 10. The method of claim 1,wherein at least one of the receiving, the determining the firstcommunication request satisfied a coalescing criterion, the storing, thedetermining a timer associated with the group of requests expires or asize of the group satisfies a group size criterion, the coalescing, orthe transporting is executed by a communication engine implemented as asoftware communication engine using a central processing unit (CPU)operatively coupled to the first GPU.
 11. The method of claim 1, whereinat least one of the receiving, the determining the first communicationrequest satisfied a coalescing criterion, the storing, the determining atimer associated with the group of requests expires or a size of thegroup satisfies a group size criterion, the coalescing, or thetransporting is executed by a communication engine implemented using asoftware communication engine in a third GPU coupled to the first GPU.12. A system comprising: a memory device; a central processing unit(CPU); and a first graphics processing unit (GPU) operatively coupled tothe memory device and the CPU, the first GPU to execute a communicationengine, wherein the communication engine is to: receive, from a sharedmemory application, a first communication request having a second GPU asa destination; determine whether the first communication requestsatisfies a coalescing criterion for network transport over a network tothe second GPU; transport, in response to determining the firstcommunication request does not satisfy the coalescing criterion, thefirst communication request to the second GPU over a peer-to-peer (P2P)connection between the first and second GPUs; in response to determiningthe first communication request satisfies the coalescing criterion,store the first communication request in association with a group ofrequests that have a common property; determine that a timer associatedwith the group of requests expires or a size of the group satisfies agroup size criterion; coalesce the group of requests into a coalescedrequest; and transport the coalesced request to the second GPU over thenetwork.
 13. The system of claim 12, wherein the communication engine isfurther to: receive, from the shared memory application, a secondcommunication request, wherein the first communication requestoriginates from a first group of threads of the first GPU, and whereinthe second communication request originates from a second group ofthreads of the first GPU; determine that the second communicationrequest satisfies the coalescing criterion; and store the secondcommunication request in association with the group of requests thathave the common property.
 14. The system of claim 12, wherein thecommunication engine is further to: receive, from the shared memoryapplication, a second communication request having a third GPU as adestination; determine that the second communication request does notsatisfy the coalescing criterion, wherein the second communicationrequest is transportable via the P2P connection with the third GPU; andtransport the second communication request to the third GPU over the P2Pconnection.
 15. The system of claim 12, wherein the communication engineis to determine that the first communication request satisfies thecoalescing criterion by determining that the first communication requestsatisfies at least one of a request size criterion, a latency criterion,or a P2P connectivity criterion.
 16. The system of claim 12, wherein thecommon property is at least one of a same operation type, a same networkdestination, a same GPU destination, or adjacent memory locations. 17.The system of claim 12, wherein the communication engine is further to:receive, from the shared memory application, a second communicationrequest, wherein the second communication request originates from agroup of threads of the first GPU and third GPU as a destination;perform group-level coalescing of the second communication request withother communication requests from the group of threads to obtain agroup-level request; determine that the group-level request satisfiesthe coalescing criterion; store the group-level request in associationwith a second group of requests that have a common property; determinethat a second timer associated with the second group of requests expiresor a size of the second group satisfies the group size criterion;coalesce the second group of requests into a second coalesced request;and transport the second coalesced request to the third GPU.
 18. Acomputing system comprising: a memory device; a first graphicsprocessing unit (GPU) operatively coupled to the memory device, thefirst GPU comprising a communication engine, wherein the communicationengine is to: receive, from a shared memory application, a firstcommunication request having a second GPU as a destination; determinewhether the first communication request satisfies a coalescing criterionfor network transport over a network to the second GPU; transport, inresponse to determining the first communication request does not satisfythe coalescing criterion, the first communication request to the secondGPU over a peer-to-peer (P2P) connection between the first and secondGPUs; in response to determining the first communication requestsatisfies the coalescing criterion, store the first communicationrequest in association with a group of requests that have a commonproperty; determine that a timer associated with the group of requestsexpires or a size of the group satisfies a group size criterion;coalesce the group of requests into a coalesced request; and transportthe coalesced request to the second GPU over the network.
 19. Thecomputing system of claim 18, wherein the communication engine is ahardware offload circuit coupled to the first GPU.
 20. The computingsystem of claim 18, wherein the first GPU executes the shared memoryapplication and the communication engine.